Skip to content

Missing CSE caused by for loop on slice #119573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Kobzol opened this issue Jan 4, 2024 · 6 comments · Fixed by llvm/llvm-project#100102
Open

Missing CSE caused by for loop on slice #119573

Kobzol opened this issue Jan 4, 2024 · 6 comments · Fixed by llvm/llvm-project#100102
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Kobzol
Copy link
Contributor

Kobzol commented Jan 4, 2024

While examining a weird performance regression on Zulip, we noticed a peculiar thing caused by a simple for loop.

Consider this code:

#[inline(never)]
fn add_index(bools: &[bool]) -> usize {
    let mut count: usize = 0;
    for i in 0..bools.len() {
        if bools[i] {
            count += 1;
        }
    }
    count
}

#[inline(never)]
fn add_iter(bools: &[bool]) -> usize {
    let mut count: usize = 0;
    for b in bools {
        if *b {
            count += 1;
        }
    }
    count
}

These two functions do the same thing, but the first one uses slice indexing, while the other one uses a more natural for loop. One would expect that the second version will optimize better (or same).

However, when these functions are called multiple times, only the first function is eligible for CSE (common subexpression elimination):

pub fn foo1(bools: &[bool]) {
    // CSE applied
    println!("a: {}", add_index(bools));
    println!("b: {}", add_index(bools));
}

pub fn foo2(bools: &[bool]) {
    // CSE not applied
    println!("a: {}", add_iter(bools));
    println!("b: {}", add_iter(bools));
}

Link to Godbolt.

When examining these two functions, I noticed that the indexed version uses nocapture for its argument, while the for loop version doesn't:

define noundef i64 @add_index(ptr noalias nocapture noundef nonnull readonly align 1 %bools.0, i64 noundef %bools.1) unnamed_addr #0 !dbg !7 {

define noundef i64 @add_iter(ptr noalias noundef nonnull readonly align 1 %bools.0, i64 noundef %bools.1) unnamed_addr #1 !dbg !57 {

which might be the reason of why CSE isn't applied.

It seems weird that the canonical way of iterating over a slice produces worse code (or at least function attributes) than manual indexing.

@Kobzol Kobzol added the C-bug Category: This is a bug. label Jan 4, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 4, 2024
@the8472
Copy link
Member

the8472 commented Jan 4, 2024

Maybe this is due to #111603 (comment) and llvm/llvm-project#74228 ?

@krtab
Copy link
Contributor

krtab commented Jan 4, 2024

Hi!

I've investigated a bit and I am sharing what I gathered so far.

Disclaimer: I have mostly investigated he fact that the iterator based version is not marked nocapture, but I don't know whether it is the cause for the missed CSE opportunity.

The main difference between the index based version and the iterator one is that the iterator ones increment pointers directly. It is basically equivalent to the following C code (which LLVM doesn't manage to mark as nocapture either):

size_t __attribute__ ((noinline)) add_iter(const bool *begin, const bool *end) {
    size_t count = 0;
    for (const bool *ptr = begin; ptr < end; ++ptr) {
        if (*ptr) {
            count += 1;
        }
    }
    return count;
}

The nocapture attribute is given to add_index by the LLVM pass PostOrderFunctionAttrsPass. However this pass doesn't manage to give it to add_iter. This may be because bools.0 appears in the phi nodes in add_iter:

add_index

  %count.07 = phi i64 [ %spec.select, %bb5 ], [ 0, %start ]
  %iter.sroa.0.06 = phi i64 [ %_0.i, %bb5 ], [ 0, %start ]

add_iter

  %count.05 = phi i64 [ %spec.select, %bb3 ], [ 0, %start ]
  %iter.sroa.0.04 = phi ptr [ %_30.i, %bb3 ], [ %bools.0, %start ]

@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jan 4, 2024
@krtab
Copy link
Contributor

krtab commented Jan 5, 2024

I have investigated this further.

Disclaimer: I have mostly investigated he fact that the iterator based version is not marked nocapture, but I don't know whether it is the cause for the missed CSE opportunity.

So it turns out that the CSE happens before functions are marked nocapture. Both could share a common cause, but the missed capturing analysis is not the cause of the missed CSE.

I have not figured why the CSE does not happen. However, I came up with a slightly different MRE that includes the CSE-able function being memchr.

Gobolt

#[inline(never)]
fn has_zero_index(xs: &[u8]) -> bool {
    for i in 0..xs.len() {
        if xs[i] == 0 {
            return true
        }
    }
    false
}

#[inline(never)]
pub fn has_zero_memchr(xs: &[u8]) -> bool {
    xs.contains(&0)
}

#[inline(never)]
pub fn has_zero_iter(xs: &[u8]) -> bool {
    xs.iter().any(|&x| x == 0)
}

#[inline(never)]
fn has_zero_ptr(xs: &[u8]) -> bool {
    let range = xs.as_ptr_range();
    let mut start = range.start;
    let end = range.end;
    while start < end {
        unsafe {
            if *start == 0 {
                return true
            }
            start = start.add(1);
        }
    }
    false
}

pub fn foo_index(xs: &[u8])  {
    // CSE
    println!("a: {}", has_zero_index(xs));
    println!("b: {}", has_zero_index(xs));
}

pub fn foo_memchr(xs: &[u8]) {
    // No CSE
    println!("a: {}", has_zero_memchr(xs));
    println!("b: {}", has_zero_memchr(xs));
}

pub fn foo_iter(xs: &[u8]) {
    // No CSE
    println!("a: {}", has_zero_iter(xs));
    println!("b: {}", has_zero_iter(xs));
}

pub fn foo_ptr(xs: &[u8]) {
    // No CSE
    println!("a: {}", has_zero_ptr(xs));
    println!("b: {}", has_zero_ptr(xs));
}

@dianqk
Copy link
Member

dianqk commented Jan 8, 2024

So it turns out that the CSE happens before functions are marked nocapture. Both could share a common cause, but the missed capturing analysis is not the cause of the missed CSE.

I have not figured why the CSE does not happen. However, I came up with a slightly different MRE that includes the CSE-able function being memchr.

The reason should be the lack of nounwind and memory(argmem: read).

@dianqk
Copy link
Member

dianqk commented Jan 8, 2024

Maybe this is due to #111603 (comment) and llvm/llvm-project#74228 ?

It's a similar issue. Based on llvm/llvm-project#74228, the following changes can solve this issue.

--- a/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionAttrs.cpp
@@ -118,7 +118,7 @@ static void addLocAccess(MemoryEffects &ME, const MemoryLocation &Loc,
   if (isNoModRef(MR))
     return;

-  const Value *UO = getUnderlyingObject(Loc.Ptr);
+  const Value *UO = getUnderlyingObjectLookThrough(Loc.Ptr);
   assert(!isa<AllocaInst>(UO) &&
          "Should have been handled by getModRefInfoMask()");
   if (isa<Argument>(UO)) {

Maybe @caojoshua will submit a new PR after this one. cc @caojoshua.

@dianqk
Copy link
Member

dianqk commented Jul 24, 2024

Fixed by llvm/llvm-project#100102 at LLVM 20.
@rustbot label +llvm-fixed-upstream

@rustbot rustbot added the llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes label Jul 25, 2024
yuxuanchen1997 pushed a commit to llvm/llvm-project that referenced this issue Jul 25, 2024
…gressive` (#100102)

Summary:
Thanks to #99509, we can fix
rust-lang/rust#119573 too.

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251024
@dianqk dianqk added E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. and removed llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes labels Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants