Skip to content

JIT doesn't optimize away redundant checks for Span.Length #9608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
GrabYourPitchforks opened this issue Jan 26, 2018 · 4 comments
Closed

JIT doesn't optimize away redundant checks for Span.Length #9608

GrabYourPitchforks opened this issue Jan 26, 2018 · 4 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

In scenarios where a caller performs a comparison and where the callee (which has been inlined into the caller) performs that same comparison, the JIT won't consolidate the redundant checks into a single check. See examples below.

[MethodImpl(MethodImplOptions.NoInlining)]
private static void DoSomething(Span<byte> dest, Span<byte> src)
{
    if ((uint)src.Length <= (uint)dest.Length)
    {
        src.CopyTo(dest);
    }
    else
    {
        Console.WriteLine("Can't copy, dest too short.");
    }
}
; ConsoleAppTestCopyPerf!Program.DoSomething(System.Span`1<Byte>, System.Span`1<Byte>):
00007ff7`d6541490 4883ec28        sub     rsp,28h
00007ff7`d6541494 4c8b02          mov     r8,qword ptr [rdx] ; this begins our src.Length <= dest.Length check
00007ff7`d6541497 8b5208          mov     edx,dword ptr [rdx+8]
00007ff7`d654149a 3b5108          cmp     edx,dword ptr [rcx+8] ; compare src.Length to dest.Length
00007ff7`d654149d 7721            ja      ConsoleAppTestCopyPerf!Program.DoSomething(System.Span`1<Byte>, System.Span`1<Byte>)+0x30 (00007ff7`d65414c0)
00007ff7`d654149f 488b01          mov     rax,qword ptr [rcx] ; Span.CopyTo is inlined into the caller, here's its length check
00007ff7`d65414a2 8b4908          mov     ecx,dword ptr [rcx+8]
00007ff7`d65414a5 3bd1            cmp     edx,ecx ; compare src.Length to dest.Length again
00007ff7`d65414a7 772f            ja      ConsoleAppTestCopyPerf!Program.DoSomething(System.Span`1<Byte>, System.Span`1<Byte>)+0x48 (00007ff7`d65414d8)
00007ff7`d65414a9 4c63ca          movsxd  r9,edx
00007ff7`d65414ac 488bc8          mov     rcx,rax
00007ff7`d65414af 498bd0          mov     rdx,r8
00007ff7`d65414b2 4d8bc1          mov     r8,r9
00007ff7`d65414b5 e8867ae855      call    System_Private_CoreLib!System.Buffer.Memmove(System.ByReference`1<Byte>, System.ByReference`1<Byte>, UInt64) (00007ff8`2c3c8f40)
00007ff7`d65414ba 90              nop
00007ff7`d65414bb 4883c428        add     rsp,28h
00007ff7`d65414bf c3              ret
00007ff7`d65414c0 48b96830d13354010000 mov rcx,15433D13068h
00007ff7`d65414ca 488b09          mov     rcx,qword ptr [rcx]
00007ff7`d65414cd e856f6ffff      call    System.Console.WriteLine(System.String) (00007ff7`d6540b28)
00007ff7`d65414d2 90              nop
00007ff7`d65414d3 4883c428        add     rsp,28h
00007ff7`d65414d7 c3              ret
; JIT shouldn't have introduced the two instructions below
00007ff7`d65414d8 e8e3caec55      call    System_Private_CoreLib!System.ThrowHelper.ThrowArgumentException_DestinationTooShort() (00007ff8`2c40dfc0)
00007ff7`d65414dd cc              int     3

I thought maybe it had something to do with querying dest's and src's length being an indirect lookup ([rdx+8] or [rcx+8]), so I also tried capturing local copies to guarantee that everything is enregistered.

[MethodImpl(MethodImplOptions.NoInlining)]
private static void DoSomething(Span<byte> dest, Span<byte> src)
{
    var d = dest;
    var s = src;
    if ((uint)s.Length <= (uint)d.Length)
    {
        s.CopyTo(d);
    }
    else
    {
        Console.WriteLine("Can't copy, dest too short.");
    }
}
; ConsoleAppTestCopyPerf!Program.DoSomething(System.Span`1<Byte>, System.Span`1<Byte>):
00007ff7`d6561490 4883ec28        sub     rsp,28h
00007ff7`d6561494 4c8b01          mov     r8,qword ptr [rcx]
00007ff7`d6561497 8b4908          mov     ecx,dword ptr [rcx+8]
00007ff7`d656149a 488b02          mov     rax,qword ptr [rdx]
00007ff7`d656149d 8b5208          mov     edx,dword ptr [rdx+8]
00007ff7`d65614a0 3bd1            cmp     edx,ecx
00007ff7`d65614a2 771b            ja      ConsoleAppTestCopyPerf!Program.DoSomething(System.Span`1<Byte>, System.Span`1<Byte>)+0x2f (00007ff7`d65614bf)
00007ff7`d65614a4 3bd1            cmp     edx,ecx ; I don't think edx / ecx changed in the past one instruction :)
00007ff7`d65614a6 772f            ja      ConsoleAppTestCopyPerf!Program.DoSomething(System.Span`1<Byte>, System.Span`1<Byte>)+0x47 (00007ff7`d65614d7)
00007ff7`d65614a8 4c63ca          movsxd  r9,edx
00007ff7`d65614ab 498bc8          mov     rcx,r8
00007ff7`d65614ae 488bd0          mov     rdx,rax
00007ff7`d65614b1 4d8bc1          mov     r8,r9
00007ff7`d65614b4 e8877ae655      call    System_Private_CoreLib!System.Buffer.Memmove(System.ByReference`1<Byte>, System.ByReference`1<Byte>, UInt64) (00007ff8`2c3c8f40)
00007ff7`d65614b9 90              nop
00007ff7`d65614ba 4883c428        add     rsp,28h
00007ff7`d65614be c3              ret
00007ff7`d65614bf 48b968306e1ba6010000 mov rcx,1A61B6E3068h
00007ff7`d65614c9 488b09          mov     rcx,qword ptr [rcx]
00007ff7`d65614cc e857f6ffff      call    System.Console.WriteLine(System.String) (00007ff7`d6560b28)
00007ff7`d65614d1 90              nop
00007ff7`d65614d2 4883c428        add     rsp,28h
00007ff7`d65614d6 c3              ret
00007ff7`d65614d7 e8e4caea55      call    System_Private_CoreLib!System.ThrowHelper.ThrowArgumentException_DestinationTooShort() (00007ff8`2c40dfc0)
00007ff7`d65614dc cc              int     3

category:cq
theme:bounds-checks
skill-level:expert
cost:small

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@erozenfeld
Copy link
Member

I investigated this (the second case with local copies). When I tried to repro the issue, I wasn't getting the call to Span.CopyTo inlined, the jit thought the inline was unprofitable. So I added [AggressiveInlining] to Span.CopyTo and then got the same codegen as reported above.

The reason the jit doesn't remove the redundant check is that the jit's assertion propagation phase only creates assertions for GT_EQ and GT_NE nodes used in jumps and in this case we have a GT_GT node.

Here is the relevant comment from the code:
https://github.com/dotnet/coreclr/blob/8dde886767682feac4b5414366dfae7be3c08412/src/jit/assertionprop.cpp#L1930

// Find assertion kind.
    switch (relop->gtOper)
    {
        case GT_EQ:
            assertionKind = OAK_EQUAL;
            break;
        case GT_NE:
            assertionKind = OAK_NOT_EQUAL;
            break;
        default:
            // TODO-CQ: add other relop operands. Disabled for now to measure perf
            // and not occupy assertion table slots. We'll add them when used.
            return NO_ASSERTION_INDEX;
    }

I prototyped an implementation that works for GT_GT nodes:
dotnet/coreclr@master...erozenfeld:AssertionProp_GT_GT

With that change the second comparison, the second jump, and the code at the target of the second jumps are eliminated:

D:\coreclr1>.\bin\tests\Windows_NT.x64.Checked\Tests\Core_Root\CoreRun.exe c:\temp\spanexample.exe
; Assembly listing for method Test:DoSomething(struct,struct)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  8   )   byref  ->  rcx
;  V01 arg1         [V01,T01] (  4,  8   )   byref  ->  rdx
;* V02 loc0         [V02    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op
;* V03 loc1         [V03    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op
;* V04 tmp0         [V04    ] (  0,  0   )     int  ->  zero-ref
;* V05 tmp1         [V05    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op
;* V06 tmp2         [V06    ] (  0,  0   )     int  ->  zero-ref
;* V07 tmp3         [V07    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
;  V08 tmp4         [V08,T04] (  2,  2   )   byref  ->   r8
;* V09 tmp5         [V09    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
;  V10 tmp6         [V10,T05] (  2,  2   )   byref  ->  rax
;  V11 tmp7         [V11,T06] (  2,  2   )    long  ->   r9
;* V12 tmp8         [V12    ] (  0,  0   )   byref  ->  zero-ref
;* V13 tmp9         [V13    ] (  0,  0   )   byref  ->  zero-ref
;* V14 tmp10        [V14    ] (  0,  0   )  struct ( 8) zero-ref
;* V15 tmp11        [V15    ] (  0,  0   )  struct ( 8) zero-ref
;* V16 tmp12        [V16    ] (  0,  0   )   byref  ->  zero-ref    V30._pointer(offs=0x00) P-INDEP
;* V17 tmp13        [V17    ] (  0,  0   )     int  ->  zero-ref    V30._length(offs=0x08) P-INDEP
;* V18 tmp14        [V18    ] (  0,  0   )   byref  ->  zero-ref    V31._pointer(offs=0x00) P-INDEP
;* V19 tmp15        [V19    ] (  0,  0   )     int  ->  zero-ref    V31._length(offs=0x08) P-INDEP
;  V20 tmp16        [V20,T07] (  2,  1.50)   byref  ->   r8         V02._pointer(offs=0x00) P-INDEP
;  V21 tmp17        [V21,T02] (  2,  2   )     int  ->  rcx         V02._length(offs=0x08) P-INDEP
;  V22 tmp18        [V22,T08] (  2,  1.50)   byref  ->  rax         V03._pointer(offs=0x00) P-INDEP
;  V23 tmp19        [V23,T03] (  3,  2.50)     int  ->  rdx         V03._length(offs=0x08) P-INDEP
;  V24 tmp20        [V24,T09] (  2,  1   )   byref  ->   r8         V05._pointer(offs=0x00) P-INDEP
;* V25 tmp21        [V25,T12] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP
;* V26 tmp22        [V26    ] (  0,  0   )   byref  ->  zero-ref    V07._value(offs=0x00) P-INDEP
;* V27 tmp23        [V27    ] (  0,  0   )   byref  ->  zero-ref    V09._value(offs=0x00) P-INDEP
;  V28 tmp24        [V28,T10] (  2,  1   )   byref  ->   r8         V14._value(offs=0x00) P-INDEP
;  V29 tmp25        [V29,T11] (  2,  1   )   byref  ->  rax         V15._value(offs=0x00) P-INDEP
;* V30 tmp26        [V30    ] (  0,  0   )  struct (16) zero-ref
;* V31 tmp27        [V31    ] (  0,  0   )  struct (16) zero-ref
;  V32 OutArgs      [V32    ] (  1,  1   )  lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 40

G_M10458_IG01:
       4883EC28             sub      rsp, 40

G_M10458_IG02:
       4C8B01               mov      r8, bword ptr [rcx]
       8B4908               mov      ecx, dword ptr [rcx+8]
       488B02               mov      rax, bword ptr [rdx]
       8B5208               mov      edx, dword ptr [rdx+8]
       3BD1                 cmp      edx, ecx
       7717                 ja       SHORT G_M10458_IG04
       4C63CA               movsxd   r9, edx
       498BC8               mov      rcx, r8
       488BD0               mov      rdx, rax
       4D8BC1               mov      r8, r9
       E8CBC39455           call     System.Buffer:Memmove(struct,struct,long)
       90                   nop

G_M10458_IG03:
       4883C428             add      rsp, 40
       C3                   ret

G_M10458_IG04:
       48B96830942A52020000 mov      rcx, 0x2522A943068
       488B09               mov      rcx, gword ptr [rcx]
       E89BFBFFFF           call     System.Console:WriteLine(ref)
       90                   nop

G_M10458_IG05:
       4883C428             add      rsp, 40
       C3                   ret

; Total bytes of code 67, prolog size 4 for method Test:DoSomething(struct,struct)
; ============================================================

Getting this change checked in will be somewhat challenging because we have a limit on the number of assertions generated per function (256) so generating these new assertions will likely bump existing assertions from some functions leading to CQ regressions. Removing this limit is something we are considering but it may have throughput implications.

@mikedn
Copy link
Contributor

mikedn commented Feb 2, 2018

Getting this change checked in will be somewhat challenging because we have a limit on the number of assertions generated per function (256) so generating these new assertions will likely bump existing assertions from some functions leading to CQ regressions. Removing this limit is something we are considering but it may have throughput implications.

Note that the specific condition if ((uint)src.Length <= (uint)dest.Length) already generates an assertion that's used for range check elimination. That's probably both good and bad. It's good because the assertion is already there, it's bad because it looks different from other relop assertions and thus it may be more cumbersome to use to eliminate redundant conditions.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@jakobbotsch
Copy link
Member

Codegen on main:

; Assembly listing for method MustHaveValueBenchmark.C:DoSomething(System.Span`1[ubyte],System.Span`1[ubyte])
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 2 inlinees with PGO data; 4 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  4,  7   )   byref  ->  rcx         ld-addr-op single-def
;  V01 arg1         [V01,T00] (  4,  8   )   byref  ->  rdx         ld-addr-op single-def
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V04 tmp2         [V04    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V05 tmp3         [V05    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V06 tmp4         [V06    ] (  0,  0   )   byref  ->  zero-ref    single-def "Inlining Arg"
;* V07 tmp5         [V07    ] (  0,  0   )   byref  ->  zero-ref    single-def "Inlining Arg"
;  V08 tmp6         [V08,T04] (  2,  2   )    long  ->  rax         "Inlining Arg"
;* V09 tmp7         [V09    ] (  0,  0   )   byref  ->  zero-ref    V15._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
;* V10 tmp8         [V10    ] (  0,  0   )     int  ->  zero-ref    V15._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
;  V11 tmp9         [V11,T05] (  2,  1.50)   byref  ->   r8         single-def V16._reference(offs=0x00) P-INDEP "field V01._reference (fldOffset=0x0)"
;  V12 tmp10        [V12,T02] (  3,  2.50)     int  ->  rdx         V16._length(offs=0x08) P-INDEP "field V01._length (fldOffset=0x8)"
;  V13 tmp11        [V13,T06] (  2,  1   )   byref  ->  rcx         single-def V04._reference(offs=0x00) P-INDEP "field V04._reference (fldOffset=0x0)"
;* V14 tmp12        [V14,T07] (  0,  0   )     int  ->  zero-ref    V04._length(offs=0x08) P-INDEP "field V04._length (fldOffset=0x8)"
;* V15 tmp13        [V15    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;* V16 tmp14        [V16    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;  V17 cse0         [V17,T03] (  2,  2   )     int  ->  rax         "CSE - aggressive"
;
; Lcl frame size = 40

G_M6266_IG01:
       sub      rsp, 40
						;; size=4 bbWeight=1    PerfScore 0.25
G_M6266_IG02:
       mov      r8, bword ptr [rdx]
       mov      edx, dword ptr [rdx+08H]
       mov      eax, dword ptr [rcx+08H]
       cmp      edx, eax
       ja       SHORT G_M6266_IG05
						;; size=13 bbWeight=1    PerfScore 7.25
G_M6266_IG03:
       mov      rcx, bword ptr [rcx]
       mov      eax, edx
       mov      rdx, r8
       mov      r8, rax
       call     [System.Buffer:Memmove(byref,byref,ulong)]
       nop      
						;; size=18 bbWeight=0.50 PerfScore 3.00
G_M6266_IG04:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=0.50 PerfScore 0.62
G_M6266_IG05:
       mov      rcx, 0xD1FFAB1E      ; 'Can't copy, dest too short.'

       call     [System.Console:WriteLine(System.String)]
       nop      
						;; size=17 bbWeight=0.50 PerfScore 1.75
G_M6266_IG06:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=0.50 PerfScore 0.62

; Total bytes of code 62, prolog size 4, PerfScore 19.70, instruction count 19, allocated bytes for code 62 (MethodHash=574ee785) for method MustHaveValueBenchmark.C:DoSomething(System.Span`1[ubyte],System.Span`1[ubyte])
; ============================================================

Looks fixed by RBO:

Dominator BB02 of BB03 has relop with same liberal VN
N006 (  6,  6) [000011] N---G--N-U-                           GT        int    <l:$240, c:$241>
N001 (  1,  1) [000025] -----------                         ├──▌  LCL_VAR   int    V12 tmp10        u:1 <l:$1c0, c:$200>
N005 (  4,  4) [000117] n---G------                         └──▌  IND       int    <l:$1c1, c:$201>
N004 (  2,  2) [000116] -------N---                            └──▌  ADD       byref  $181
N002 (  1,  1) [000029] -----------                               ├──▌  LCL_VAR   byref  V00 arg0         u:1 $80
N003 (  1,  1) [000115] -----------                               └──▌  CNS_INT   long   8 $140
 Redundant compare; current relop:
N003 (  5,  4) [000041] N------N-U-                           GT        int    <l:$240, c:$242>
N001 (  1,  1) [000033] -----------                         ├──▌  LCL_VAR   int    V12 tmp10        u:1 <l:$1c0, c:$200>
N002 (  3,  2) [000060] -----------                         └──▌  LCL_VAR   int    V14 tmp12        u:1 (last use) <l:$1c1, c:$202>
Fall through successor BB03 of BB02 reaches, relop [000041] must be false

Redundant branch opt in BB03:

removing useless STMT00010 ( INL03 @ ??? ... ??? ) <- INLRT @ 0x010[E-]
N004 (  7,  6) [000042] -----------                           JTRUE     void   $VN.Void
N003 (  5,  4) [000041] -----------                         └──▌  CNS_INT   int    0
 from BB03

Conditional folded at BB03
BB03 becomes a BBJ_NONE
Compiler::optRedundantBranch removed tree:
N004 (  7,  6) [000042] -----------                           JTRUE     void   $VN.Void
N003 (  5,  4) [000041] -----------                         └──▌  CNS_INT   int    0

@ghost ghost locked as resolved and limited conversation to collaborators Jan 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants