Skip to content

[LLD][AARCH64] Bug with ADRP + LDR transformation when there is a branch to the LDR #138254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jeanPerier opened this issue May 2, 2025 · 9 comments

Comments

@jeanPerier
Copy link
Contributor

jeanPerier commented May 2, 2025

Summary:

The issue is how LLD replaces two ADRP + LDR pairs to the same symbol inconsistently while there is a branch somewhere in the program to one of the LDR that assumes both ADRP will end-up with the same offset in the program.

That following assembly:

        adrp    x2, :got:b
        // unrelated instructions
        ldr     x3, [x2, :got_lo12:b]  // note that source register !=dest

       // ... some code and check
       bge .L9
       // ... some more code
        adrp    x2, :got:b
.L9:
        ldr     x2, [x2, :got_lo12:b]

Is rewritten in the final executable by LLD into the following:

        adrp	x2, 0x220000
        // ...
        ldr	x3, [x2, #2944]

       // ... 
       bge .L9
       // ... 
        adrp	x2, 0x419000 <a+1999920>
.L9:
        add	x2, x2, #0x50

But now the path that jumps directly to L9 with the value from the first adrp will execute adrp x2, 0x220000 followed by add x2, x2, #0x50 which ends computing the wrong address.

Note that the hardcoded addresses in this illustration will of course be different in different environments, what matters is that the two adrp now have different offsets ( 0x220000 vs 0x419000).

The actual assembly were I discovered the issue is much more complex and was actually obtained when compiling LLVM source with gcc 9.3.1. More details below in the "Where I saw the bug" section below.

I think LLD should not optimize ADRP + LDR if there is a branch in the program to the LDR. I am not expert enough in aarch64 assembly to asses the validity of GCC generating a branch to the second LDR.

This was observed with LLD as old as LLD 17 and is still present in the LLVM head I tested (099a0fa) .

Small full repoducer:

Reproduce with the attached repro.tar.gz:

tar -xzvf repro.tar.gz
cd repro
bash build_and_run.sh

This will print the following output (the exact 4296785 number does not matter, this is an address, what matters is that it should be printed twice, while here 2228305 is printed):

expect 4296785 to be printed twice below:
4296785
2228305

Dump of assembler code for function foo:
   0x0000000000210880 <+0>:	cmp	x0, #0x0
   0x0000000000210884 <+4>:	b.ge	0x210898 <foo+24>  // b.tcont
   0x0000000000210888 <+8>:	adrp	x2, 0x419000 <a+1999920>
   0x000000000021088c <+12>:	add	x2, x2, #0x50
   0x0000000000210890 <+16>:	add	x0, x2, x1
   0x0000000000210894 <+20>:	b	0x210804 <bar>
   0x0000000000210898 <+24>:	stp	x29, x30, [sp, #-32]!
   0x000000000021089c <+28>:	adrp	x2, 0x220000
   0x00000000002108a0 <+32>:	add	x0, x0, #0x1
   0x00000000002108a4 <+36>:	ldr	x3, [x2, #2944]
   0x00000000002108a8 <+40>:	add	x0, x3, x0
   0x00000000002108ac <+44>:	mov	x29, sp
   0x00000000002108b0 <+48>:	stp	x2, x1, [sp, #16]
   0x00000000002108b4 <+52>:	bl	0x210804 <bar>
   0x00000000002108b8 <+56>:	ldp	x2, x1, [sp, #16]
   0x00000000002108bc <+60>:	ldp	x29, x30, [sp], #32
   0x00000000002108c0 <+64>:	b	0x21088c <foo+12>

You can see the bogus assembly from the executable for foo. For the record, foo from foo.s is manually modified assembly to get the branch on the second LDR for the following C code:

void foo(int64_t i, int64_t j) {
  if (i >=0)
    bar(i+1+(int64_t)&b);
  bar(j+(int64_t)&b);
}

Where I saw the bug:

This bug causes some LLVM release with assert builds with gcc (at least 9.3.0) + lld to produce a flang that will hit a weird compiler error on some Fortran programs when generation IR for WHERE statements. I am providing some details below for the record, but I do not think it matters here that I provide a full Fortran reproducer that will trigger this bug to cause an ICE in flang.

The bug at hand is breaking the generated code for llvm::DenseMap insertion in flang:

In repro.tar.gz attached above, you will find the assembly code before and after linking the excutable for the instantiation of the template function void mlir::IRMapping::map<llvm::MutableArrayRef<mlir::BlockArgument>, mlir::ValueRange&, (void*)0>(llvm::MutableArrayRef<mlir::BlockArgument>&&, mlir::ValueRange&) (a.k.a _ZN4mlir9IRMapping3mapIN4llvm15MutableArrayRefINS_13BlockArgumentEEERNS_10ValueRangeELPv0EEEvOT_OT0_) that is declared here in MLIR source and instantiated here in flang source.

Both ADRP + LDR are generated to compute the DenseMap seed using the llvm::install_fatal_error_handler function address here.

The bug is then triggered when this exact template instantiation is entered and the map needs to be grown during the insertion (this causes the code path to make the critical jump for this bug). In that case, the value is wrongly inserted into the wrong bucket because the hash used the bogus address. Later searches fail to retrieve the value map to that key because they are using the right hash seed.

The flang bug is very hard to reproduce because one needs to get gcc to do the right inlining and block merging + CSE + register assignment to create that branch on the second LDR. The distance in the text between the template code and llvm::install_fatal_error_handler also needs to be big enough so that LLD keeps ADRP.

@llvmbot
Copy link
Member

llvmbot commented May 2, 2025

@llvm/issue-subscribers-backend-aarch64

Author: None (jeanPerier)

**Summary:**

The issue is how LLD replaces two ADRP + LDR pairs to the same symbol inconsistently while there is a branch somewhere in the program to one of the LDR that assumes both ADRP will end-up with the same offset in the program.

That following assembly:

        adrp    x2, :got:b
        // unrelated instructions
        ldr     x3, [x2, :got_lo12:b]  // note that source register !=dest

       // ... some code and check
       bge .L9
       // ... some more code
        adrp    x2, :got:b
.L9:
        ldr     x2, [x2, :got_lo12:b]

Is rewritten in the final executable by LLD into the following:

        adrp	x2, 0x220000
        // ...
        ldr	x3, [x2, #<!-- -->2944]

       // ... 
       bge .L9
       // ... 
        adrp	x2, 0x419000 &lt;a+1999920&gt;
.L9:
        add	x2, x2, #<!-- -->0x50

But now the path that jumps directly to L9 with the value from the first adrp will execute adrp x2, 0x220000 followed by add x2, x2, #<!-- -->0x50 which ends computing the wrong address.

Note that the hardcoded addresses in this illustration will of course be different in different environments, what matters is that the two adrp now have different offsets ( 0x220000 vs 0x419000).

The actual assembly were I discovered the issue is much more complex and was actually obtained when compiling LLVM source with gcc 9.3.1. More details below in the "Where I saw the bug" section below.

I think LLD should not optimize ADRP + LDR if there is a branch in the program to the LDR. I am not expert enough in aarch64 assembly to asses the validity of GCC generating a branch to the second LDR.

This was observed with LLD as old as LLD 17 and is still present in the LLVM head I tested (099a0fa) .

Small full repoducer:

Reproduce with the attached repro.tar.gz:

tar -xzvf repro.tar.gz
cd repro
bash build_and_run.sh

This will print the following output (the exact 4296785 number does not matter, this is an address, what matters is that it should be printed twice, while here 2228305 is printed):

expect 4296785 to be printed twice below:
4296785
2228305

Dump of assembler code for function foo:
   0x0000000000210880 &lt;+0&gt;:	cmp	x0, #<!-- -->0x0
   0x0000000000210884 &lt;+4&gt;:	b.ge	0x210898 &lt;foo+24&gt;  // b.tcont
   0x0000000000210888 &lt;+8&gt;:	adrp	x2, 0x419000 &lt;a+1999920&gt;
   0x000000000021088c &lt;+12&gt;:	add	x2, x2, #<!-- -->0x50
   0x0000000000210890 &lt;+16&gt;:	add	x0, x2, x1
   0x0000000000210894 &lt;+20&gt;:	b	0x210804 &lt;bar&gt;
   0x0000000000210898 &lt;+24&gt;:	stp	x29, x30, [sp, #-32]!
   0x000000000021089c &lt;+28&gt;:	adrp	x2, 0x220000
   0x00000000002108a0 &lt;+32&gt;:	add	x0, x0, #<!-- -->0x1
   0x00000000002108a4 &lt;+36&gt;:	ldr	x3, [x2, #<!-- -->2944]
   0x00000000002108a8 &lt;+40&gt;:	add	x0, x3, x0
   0x00000000002108ac &lt;+44&gt;:	mov	x29, sp
   0x00000000002108b0 &lt;+48&gt;:	stp	x2, x1, [sp, #<!-- -->16]
   0x00000000002108b4 &lt;+52&gt;:	bl	0x210804 &lt;bar&gt;
   0x00000000002108b8 &lt;+56&gt;:	ldp	x2, x1, [sp, #<!-- -->16]
   0x00000000002108bc &lt;+60&gt;:	ldp	x29, x30, [sp], #<!-- -->32
   0x00000000002108c0 &lt;+64&gt;:	b	0x21088c &lt;foo+12&gt;

You can see the bogus assembly from the executable for foo. For the record, foo from foo.s is manually modified assembly to get the branch on the second LDR for the following C code:

void foo(int64_t i, int64_t j) {
  if (i &gt;=0)
    bar(i+1+(int64_t)&amp;b);
  bar(j+(int64_t)&amp;b);
}

Where I saw the bug:

This bug causes some LLVM release with assert builds with gcc (at least 9.3.0) + lld to produce a flang that will hit a weird compiler error on some Fortran programs when generation IR for WHERE statements. I am providing some details below for the record, but I do not think it matters here that I provide a full Fortran reproducer that will trigger this bug to cause an ICE in flang.

The bug at hand is breaking the generated code for llvm::DenseMap insertion in flang:

In repro.tar.gz attached above, you will find the assembly code before and after linking the excutable for the instantiation of the template function void mlir::IRMapping::map&lt;llvm::MutableArrayRef&lt;mlir::BlockArgument&gt;, mlir::ValueRange&amp;, (void*)0&gt;(llvm::MutableArrayRef&lt;mlir::BlockArgument&gt;&amp;&amp;, mlir::ValueRange&amp;) (a.k.a _ZN4mlir9IRMapping3mapIN4llvm15MutableArrayRefINS_13BlockArgumentEEERNS_10ValueRangeELPv0EEEvOT_OT0_) that is declared here in MLIR source and instantiated here in flang source.

Both ADRP + LDR are generated to compute the DenseMap seed using the llvm::install_fatal_error_handler function address here.

The bug is then triggered when this exact template instantiation is entered and the map needs to be grown during the insertion (this causes the code path to make the critical jump for this bug). In that case, the value is wrongly inserted into the wrong bucket because the hash used the bogus address. Later searches fail to retrieve the value map to that key because they are using the right hash seed.

The flang bug is very hard to reproduce because one needs to get gcc to do the right inlining and block merging + CSE + register assignment to create that branch on the second LDR. The distance in the text between the template code and llvm::install_fatal_error_handler also needs to be big enough so that LLD keeps ADRP.

@davemgreen
Copy link
Collaborator

I think @smithp35 might have been looking at a similar issue recently with llvm outlining the ldr away from the adrp.

@smithp35
Copy link
Collaborator

smithp35 commented May 2, 2025

I think @smithp35 might have been looking at a similar issue recently with llvm outlining the ldr away from the adrp.

Yes, the issue is #129122 that came from the outliner producing GOT generating ADRP and ADD relocations in two separate sections separated by a branch. The specific "optimisation" came from identical code folding, and was dependent on one of the sections being folded, but the other one not.

I'll have to take a look at the reproducer, as this should have at least been partially mitigated by 099a0fa

There were some follow up PRs:
#131630 which was abandoned in favour of #134342 which was recently merged.

Will be worth giving a recent build a try as it looks like the same issue a GOT relative ADRP/ADD there is one outstanding issue, which is GOT relative relocations to local symbols (which are a lot rarer) #136641

There's a related issue #131660 on the outliner to tell it not to split ADRP (ADD/LDR) as while it is technically legal it is in general not a good thing to do for performance.

@jeanPerier
Copy link
Contributor Author

jeanPerier commented May 2, 2025

Thanks for the feedback, note that I do not think the outliner is involved (although I am not familiar with it).

The code from this bug that has a branch to an LDR was generated by GCC. It does not looks like LLD is doing any outlining on this. The issue seems to be purely at the ADRP + LDR replacement pattern level in LLD that likely do not detect the branch on the LDR when optimizing it to a slightly different ADRP + ADD.

This ADRP+LDR to ADRP + ADD pattern seems to be tested in this LLD test, I am not sure which part of LLD is responsible for this relaxation).

@smithp35
Copy link
Collaborator

smithp35 commented May 2, 2025

Yes I think this is a complete separate problem. This is confined to one section and doesn't involve ICF. I'll have a look to see if I can see what the problem is.

@smithp35
Copy link
Collaborator

smithp35 commented May 2, 2025

OK. My apologies for taking so long to respond, I got too focussed on whether lld was applying the transformation correctly and completely missed that the code was branching into the middle of an ADRP, LDR pair which would obviously be at risk of breaking if it had assumed it could reuse the result of the ADRP from the previous calculation.

I think this will most likely be made a compiler code-generation problem as LLD is meeting the conditions set out in the ABI for performing into the translation
https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#579relocation-optimization

A static linker can't realistically detect if code has arbitrarily jumped (branches within a function don't have a relocation) into the middle of a linker relaxable sequence. We'll most likely have to write requirements for code-generators into the ABI to not permit this, or linker optimisation becomes impossible, which is most likely the wrong trade-off.

IIUC this is a problem found in a flang binary which has been generated by gcc 9.3.0 , but flang itself. I'll have a word with our GNU team to make sure they are aware.

It is possible to turn off lld optimisation of ADRP/LDR with --no-relax if you need a workaround.

@jeanPerier
Copy link
Contributor Author

Thanks for the feedback!

So is it safe to assume that clang/llvm aarch64 backend will never generate this kind of branch in the middle of an ADRP + LDR pattern?

IIUC this is a problem found in a flang binary which has been generated by gcc 9.3.0 , but flang itself. I'll have a word with our GNU team to make sure they are aware.

That is right.

I checked with several versions of gcc, and 9.3.0 seems to be the only one that produce this weird jump in the _ZN4mlir9IRMapping3mapIN4llvm15MutableArrayRefINS_13BlockArgumentEEERNS_10ValueRangeELPv0EEEvOT_OT0_ code of flang binary (GCC 10, 11 and 12 made different inlining decision and did not expose the two ADRP + LDR in a way they could be block merged, so I cannot really say they would not do it in other contexts. GCC 13, 14, and 15 took similar inlining decision but created a jump after the second LDR, and not on it).

It is hard to tell from my limited case in flang that involves a lot of factors if GCC>9.3.0 are safe or not (that is, if it now explicitly avoids jumping in the middle of ADRP+LDR, or if it just gets lucky on my example because it makes different inlining/register allocation decisions now). So it is definitely worth double checking with some AARCH GCC expert here, thanks!

@efriedma-quic
Copy link
Collaborator

So is it safe to assume that clang/llvm aarch64 backend will never generate this kind of branch in the middle of an ADRP + LDR pattern?

It's unlikely, but I don't think we have any explicit protection against it.

@smithp35
Copy link
Collaborator

smithp35 commented May 6, 2025

I had a quick check with the GCC team. More recent releases of GCC keep GOT loads of ADRP and LDR consecutive, so this kind of case will be unlikely. I didn't get a guarantee that there won't be branches into the middle of an ADRP/LDR sequence though, and we have an existence proof that it does happen in existing code.

Thinking about this a bit more and looking at LLD's code. While we can't realistically ask linker's to check if a branch destination is in the middle of a optimisable ADRP/LDR, we should be able to check whether all ADRP/LDR to a given symbol in the section are optimisable (adjacent ADRP/LDR, and in range). If all sequences are optimised or none of them are optimised we'll be protected from this problem. A more conservative implementation could just abort the optimisation if any ADRP/LDR is either out of range or non-adjacent.

LLD does have some code to check that all ADRP/LDR relocations are consecutive in the relocation table.

To summarise:

  • We should add a condition to the ABI that for a symbol S, there cannot be partial application of the optimisation within a section. All have to be optimised or none of them.
  • LLD can be fixed to protect against this case.

ABI issue raised ARM-software/abi-aa#328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants