Skip to content

[ARM] Possible race condition in std::atomic_compare_exchange_strong #29793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rengolin opened this issue Sep 19, 2016 · 29 comments
Closed

[ARM] Possible race condition in std::atomic_compare_exchange_strong #29793

rengolin opened this issue Sep 19, 2016 · 29 comments
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category

Comments

@rengolin
Copy link
Member

Bugzilla Link 30445
Resolution FIXED
Resolved on Dec 12, 2016 03:54
Version unspecified
OS Linux
Attachments Reproducing script and objects
CC @rovka,@mclow,@pirama-arumuga-nainar,@rengolin,@stephenhines

Extended Description

The tests for atomic_compare_exchange_strong may be exhibiting a race condition.

Source:
http://buildmaster.tcwglab.linaro.org/builders/libcxx-libcxxabi-libunwind-arm-linux

Environment:
ARMv7 Cortex-A15
Ubuntu Xenial 16.04.2
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.2)
GNU ld (GNU Binutils for Ubuntu) 2.26.1
ldd (Ubuntu GLIBC 2.23-0ubuntu3) 2.23

The same tests pass on our production bot:
ARMv7 Cortex-A15
Ubuntu Trusty 14.04
gcc version 4.8.2 (Ubuntu/Linaro 4.8.2-19ubuntu1)
GNU ld (GNU Binutils for Ubuntu) 2.24
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.3) 2.19

I'm building in the exact same configuration (our master is a mirror). I have tested it on two different boards (Chromebooks and Jetson TK1s) with Xenial and the problem is consistent on both with Xenial and neither with Trusty.

When I run the tests directly, I get an assert like this:

atomic_compare_exchange_strong.pass.cpp:41: void TestFn::operator()() const [T = long long]: Assertion `a == T(2)' failed.

The code is:

{
    typedef std::atomic<T> A;
    A a;
    T t(T(1));
    std::atomic_init(&a, t);
    assert(std::atomic_compare_exchange_strong(&a, &t, T(2)) == true);
    assert(a == T(2));
    assert(t == T(1));
    assert(std::atomic_compare_exchange_strong(&a, &t, T(3)) == false);
    assert(a == T(2));
    assert(t == T(2));
}

The first exchange does the right thing, compare, they're equal, copies T(2) into a, then the asserts are correct.

The second exchange, however, produce false (as expected), but now a has the value of 3, which is completely unexpected. It shouldn't be writing onto a at all!

In GDB, when I run get a stack trace, it prints __a_ = 3, but when I then set a breakpoint on that line, it keeps repeating (as if looping over, may be the test harness), and always producing the correct, expected results.

For completeness, here's the stack trace:

Program received signal SIGABRT, Aborted.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
47 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#​0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#​1 0xb6d76638 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#​2 0xb6d7733a in __GI_abort () at abort.c:89
#​3 0xb6d719b2 in __assert_fail_base (fmt=0x1 <error: Cannot access memory at address 0x1>, assertion=0x2a0d8 "a == T(2)", assertion@entry=0x2 <error: Cannot access memory at address 0x2>, file=file@entry=0xb6ff3340 "", line=41,
line@entry=3068375168, function=function@entry=0x2a30f "void TestFn::operator()() const [T = long long]") at assert.c:92
#​4 0xb6d71a4a in __GI___assert_fail (assertion=0x2 <error: Cannot access memory at address 0x2>, file=0xb6ff3340 "", line=3068375168, function=0x2a30f "void TestFn::operator()() const [T = long long]") at assert.c:101
#​5 0x000239f0 in TestFn::operator() (this=0xbefff1fc) at /home/linaro/devel/llvm/worktree/repos/libcxx/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_compare_exchange_strong.pass.cpp:41
#​6 0x00010700 in TestEachIntegralType::operator() (this=0xbefff240) at /home/linaro/devel/llvm/worktree/repos/libcxx/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_helpers.h:39
#​7 0x0001067c in TestEachAtomicType::operator() (this=0xbefff254) at /home/linaro/devel/llvm/worktree/repos/libcxx/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_helpers.h:60
#​8 0x00010654 in main () at /home/linaro/devel/llvm/worktree/repos/libcxx/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_compare_exchange_strong.pass.cpp:61
(gdb) f 5
#​5 0x000239f0 in TestFn::operator() (this=0xbefff1fc) at /home/linaro/devel/llvm/worktree/repos/libcxx/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_compare_exchange_strong.pass.cpp:41
41 assert(a == T(2));
(gdb) p a
$1 = {<std::__1::__atomic_base<long long, true>> = {<std::__1::__atomic_base<long long, false>> = {_a = 3, static is_always_lock_free = }, }, }

@llvmbot
Copy link
Member

llvmbot commented Sep 30, 2016

@​Renato Our header simply forwards to clang's __c11_atomic_foo builtins. The bug exhibited here is almost certainly in the builtins, not libc++.

I'm rebinning this under Clang. Feel free to move it back if you disagree.

@llvmbot
Copy link
Member

llvmbot commented Sep 30, 2016

I'll take a look at this.

@llvmbot
Copy link
Member

llvmbot commented Sep 30, 2016

@​Renato:

Would you mind testing this small reproducer for me please?

---------->8-----------
#include
#include <type_traits>
#include

void testme()
{
{
typedef std::atomic A;
A a;
long long t((long long)(1));
std::atomic_init(&a, t);
assert(std::atomic_compare_exchange_strong(&a, &t, (long long)(2)) == true);
assert(a == (long long)(2));
assert(t == (long long)(1));
assert(std::atomic_compare_exchange_strong(&a, &t, (long long)(3)) == false);
assert(a == (long long)(2));
assert(t == (long long)(2));
}
}

int main()
{
testme();
}
---------->8-----------

If it asserts (I expect it to), could you please send me over the binaries (image + object) as before? The original test is templated and produces a lot of output, this should make it easy for me to look at the codegen.

Also, what options are you using with this run? Your linked bot seems to be down. I presume the options are same as those on http://lab.llvm.org:8011/builders/libcxx-libcxxabi-libunwind-arm-linux?

I can still work on the original binaries somewhat, but the above reproducer should make it easier.

@rengolin
Copy link
Member Author

rengolin commented Oct 5, 2016

Hi Asiri,

Sorry for the delay, I was at Connect.

The reproducer does assert:

atomic_compare_exchange_strong.pass.cpp:16: void testme(): Assertion `a == (long long)(2)' failed.

Program received signal SIGABRT, Aborted.
__libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
47 ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt
#​0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#​1 0xb6d76638 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#​2 0xb6d7733a in __GI_abort () at abort.c:89
#​3 0xb6d719b2 in __assert_fail_base (fmt=0x1 <error: Cannot access memory at address 0x1>,
assertion=0x118b2 "a == (long long)(2)", assertion@entry=0x2 <error: Cannot access memory at address 0x2>,
file=file@entry=0xb6ff3340 "", line=16, line@entry=3068375168, function=function@entry=0x118a4 "void testme()")
at assert.c:92
#​4 0xb6d71a4a in __GI___assert_fail (assertion=0x2 <error: Cannot access memory at address 0x2>,
file=0xb6ff3340 "", line=3068375168, function=0x118a4 "void testme()") at assert.c:101
#​5 0x00011790 in testme () at atomic_compare_exchange_strong.pass.cpp:16
#​6 0x000117e4 in main () at atomic_compare_exchange_strong.pass.cpp:23

attaching the source/objects.

@rengolin
Copy link
Member Author

rengolin commented Oct 5, 2016

New simpler reproducer

@rengolin
Copy link
Member Author

Asiri,

Just FYI, I fixed our bot, which is back exhibiting this bug:

http://buildmaster.tcwglab.linaro.org/builders/libcxx-libcxxabi-libunwind-arm-linux/builds/101

So we can monitor again if commits fix it before moving the bot to the upstream master.

cheers,
--renato

@llvmbot
Copy link
Member

llvmbot commented Oct 10, 2016

Hi Rento,

How urgent is this btw? I'm a bit stuck with some work. I can ask someone else to have a peak if this is quite urgent.

Cheers,

/ Asiri

@rengolin
Copy link
Member Author

Hi Asiri,

We're just waiting on this bug to move the new libcxx builder to production and de-phase the old Chromebook-based ones.

We still have other bugs to look at [1] and since you know libc++ better than we do, I'll let you fix this one. :)

The libc++ bot is not high traffic, so it should be fine keeping it downstream, since it's already tracking trunk and is publicly accessible (but not emailing anyone).

cheers,
--renato

[1] http://buildmaster.tcwglab.linaro.org/builders/clang-cmake-armv7-a15/builds/693

@rengolin
Copy link
Member Author

And, just like that, it stopped happening... :/

If it happens again I'll re-open this bug.

@llvmbot
Copy link
Member

llvmbot commented Nov 21, 2016

Sorry, I didn't find any time to look into this :(

Glad it went away on its own :P

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2016

We're seeing this on Android too. Definitely not an issue with libc++ because you can reduce the test case down to just the atomic intrinsics and it still fails:

#include

int main() {
_Atomic(long long) a;
long long t = 1LL;
long long l;

__c11_atomic_store(&a, 2LL, __ATOMIC_SEQ_CST);
l = __c11_atomic_load(&a, __ATOMIC_SEQ_CST);
assert(l == 2LL);

assert(__c11_atomic_compare_exchange_strong(&a, &t, 3LL, __ATOMIC_SEQ_CST,
                                            __ATOMIC_SEQ_CST) == false);
l = __c11_atomic_load(&a, __ATOMIC_SEQ_CST);
assert(l == 2LL); // Should still be 2 since the comparison failed, but is 3.
assert(t == 2LL);

}

Pirama just tried with a current Clang and said he's still seeing it. Not sure why Renato's failure disappeared.

Some other notes:

  • Doesn't affect arm64
  • Only affects long long

@rengolin
Copy link
Member Author

rengolin commented Dec 2, 2016

Hi Dan,

I'm using Xenial's default Clang: 3.8.0.

We have no way of fixing GCC 5.3, at least not in time to backport it to Ubuntu's updates, which will be in their own time.

We have found a number of problems with that version of GCC, so using Clang was the only solution. I do have a list of all the problems we found, if you're interested.

But regarding Clang, which version are you using? 3.9.0? Some trunk?

If Clang still fails and our 3.8 passes by luck, I very much want to know about it! :)

cheers,
--renato

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2016

Looks like we're currently on r275480. I think Pirama did his test with ToT.

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2016

Oh, the other important repro piece: it only shows up at -O0. That's probably why Renato isn't seeing it any more.

@rengolin
Copy link
Member Author

rengolin commented Dec 2, 2016

This is not good. I'm reopening this.

Could this be the result of Clang optimising (and eliding the problem) with the code? Or undefined behaviour? I don't know enough about the new standards (or atomics) to identify undefined behaviour in there.

cheers,
--renato

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2016

I don't have access to a board / model, I'm looking at the assembly generated for the following program at -O3:

int foo() {
_Atomic(long long) a;
long long t = 1LL;

__c11_atomic_store(&a, 2LL, __ATOMIC_SEQ_CST);

__c11_atomic_compare_exchange_strong(&a,
&t,
3LL,
__ATOMIC_SEQ_CST,
__ATOMIC_SEQ_CST);

return __c11_atomic_load(&a, __ATOMIC_SEQ_CST);
}

I can't see a problem. Will try trunk.

@llvmbot
Copy link
Member

llvmbot commented Dec 2, 2016

(Right, missed the comment, will check -O0 too)

@llvmbot
Copy link
Member

llvmbot commented Dec 3, 2016

I can't see an obvious error at -O0 for clang 3.8.0, but the assembly generated is quite messy, so I could be missing something. I will load this onto a model on Monday and look at the trace to make sure.

@​Renato: could you check if my code snippet, that is:

int foo() {
_Atomic(long long) a;
long long t = 1LL;

__c11_atomic_store(&a, 2LL, __ATOMIC_SEQ_CST);

__c11_atomic_compare_exchange_strong(&a,
&t,
3LL,
__ATOMIC_SEQ_CST,
__ATOMIC_SEQ_CST);

return __c11_atomic_load(&a, __ATOMIC_SEQ_CST);
}

Produces the wrong result (i.e. returns anything but 2) for you on 3.8.0 at -O0? Just to double check if my simplification has not removed the codegen fault.

Thanks.

@llvmbot
Copy link
Member

llvmbot commented Dec 3, 2016

Trunk codegen has changed a fair bit; both r11 and sp are used as base registers for accessing the stack, which makes tracing the assembly output quite hard (at -O0). I will compare the traces on Monday, that should make it easy to pick out the problem.

/ Asiri

@llvmbot
Copy link
Member

llvmbot commented Dec 5, 2016

Attempt reproduce
Tried this on a model but could not reproduce (trunk, -O0).

@​Dan, Renato: Can one of you please try the attached reproducer (t1.c) and confirm that you can still see the problem (i.e. prints 3 instead of 2)?

If so, could you please attach the disassembly that is failing? Just to make absolutely sure we're on the same page.

Thanks.

/ Asiri

@rengolin
Copy link
Member Author

rengolin commented Dec 5, 2016

Hi Asiri,

Clang 3.8.0 works well (ie, prints 2) with your example. I tried multiple changes to the command line (O0 ~ O3, -g, -std, etc) and they all work.

--renato

@pirama-arumuga-nainar
Copy link
Collaborator

@pirama-arumuga-nainar
Copy link
Collaborator

I tested Asiri's reproducer (t1.c) on ToT clang (with a slight change of directly returning the value from main instead of printing to stdout). I can confirm that it works correctly on ToT clang (and returns 2).

I am attaching the objdump of binaries produced by ToT clang and the one used by Android (which is based on r275480). I couldn't explain why the output from r275480 is broken, but hopefully Asiri or Renato can figure it out.

@llvmbot
Copy link
Member

llvmbot commented Dec 6, 2016

So the problem with r275480 is here:

480: e1b18f9f ldrexd r8, [r1]
484: e1580004 cmp r8, r4
488: e0d9c005 sbcs ip, r9, r5
48c: 1a000002 bne 49c <_Z3foov+0xac>
490: e1a1cf96 strexd ip, r6, [r1]

That cmp/sbcs combination does not work for comparing two 64-bit values. In this case, r8=2, r4=1 and r9=r5=0. So the cmp sets C=1 (not borrow) and sbcs performs ip = r9 - r5 - not(C), which evaluates to zero, so the branch is not taken and you get the strexd.

On ToT, this is done like so:

494: e1b38f9f ldrexd r8, [r3]
498: e1580004 cmp r8, r4
49c: 01590005 cmpeq r9, r5
4a0: 1a000002 bne 4b0 <_Z3foov+0xc0>
4a4: e1a3cf96 strexd ip, r6, [r3]

Which is more straightforward to understand (and correct!).

I could run the bisector and see which commit fixed the issue, but given it is fixed, perhaps we should all move on? ;-)

@pirama-arumuga-nainar
Copy link
Collaborator

r288433 with a pretty on-the-mark commit message. Found by searching 'O0' in 'git log', since I was pretty sure this wasn't fixed in my checkout from late November.

@rengolin
Copy link
Member Author

Pirama,

Have you tried reverting it to see if the behaviour goes away?

@llvmbot
Copy link
Member

llvmbot commented Dec 12, 2016

@​Renato: r288433 is the commit that fixes the problem...

I think this issue can now be closed? It would be pretty safe to back-port r288433 if android is stuck on an earlier revision.

@rengolin
Copy link
Member Author

Sorry, I thought you said that was the one that broke it. :)

It is safe to close the bug, thanks for the investigations.

@KernelAddress
Copy link
Mannequin

KernelAddress mannequin commented Nov 26, 2021

mentioned in issue llvm/llvm-bugzilla-archive#30675

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla clang Clang issues not falling into any other category
Projects
None yet
Development

No branches or pull requests

3 participants