Skip to content

-fzero-call-used-regs should not trigger before tail-calls #129764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nelhage opened this issue Mar 4, 2025 · 8 comments
Open

-fzero-call-used-regs should not trigger before tail-calls #129764

nelhage opened this issue Mar 4, 2025 · 8 comments

Comments

@nelhage
Copy link

nelhage commented Mar 4, 2025

I believe that -fzero-call-used-regs should be modified to not clear registers prior to a tail call. Here's my reasoning:

With the landing of clang::musttail, there's been a bit of a trend towards using indirect tail calls to implement efficient interpreters and parsers; see the original post about protobuf, and CPython's recent new interpreter. This pattern is, in part, an alternative to using computed gotos to implement dispatch within a single large interpreter function.

In both cases (computed gotos, and indirect tail calls), the opcode/parser definition generates fairly similar code, ending with an indirect call through a dispatch table. Depending on compiler choices, this turns into (on x86) something like jmpq *%REG or jmpq *(%REG1, %REG2, 8)

With -fzero-call-used-regs enabled, clang/LLVM currently emit call-used-clearing xors prior to the indirect tail-call, but not prior to a computed goto, even one that produces near-identical machine code (example on goldbolt, showing the stylized core of an interpreter loop).

Such interpreter loops tend to be extreme hot spots. On CPython, I've measured the cost of -fzero-call-used-regs=used-gpr on only the opcode functions at about 2% on the pyperformance suite, when using the tail-call interpreter. It seems surprising and "unfair" to impose this cost on the tail-call style but not the computed goto style of interpreter, when, again, they emit very similar machine code containing similar indirect jumps (and potential JOP gadgets).

Also, GCC's implementation behaves in the way I describe, eliding the clearing for tail calls. See a godbolt example -- if you remove the clang::musttail and add -fno-optimize-sibling-calls to the GCC options, the xors will reappear

@nelhage
Copy link
Author

nelhage commented Mar 4, 2025

cc @Fidget-Spinner for visibility on the CPython side.

I measured the "2% on pyperformance" number by comparing -fzero-call-used-regs=used-gpr with and without this patch:

diff --git i/Python/ceval_macros.h w/Python/ceval_macros.h
index 1bef2b845d0..cddad845fea 100644
--- i/Python/ceval_macros.h
+++ w/Python/ceval_macros.h
@@ -77,9 +77,10 @@
     // Note: [[clang::musttail]] works for GCC 15, but not __attribute__((musttail)) at the moment.
 #   define Py_MUSTTAIL [[clang::musttail]]
 #   define Py_PRESERVE_NONE_CC __attribute__((preserve_none))
+#   define Py_SKIP_ZERO_USED_REGS __attribute__((zero_call_used_regs("skip")))
     Py_PRESERVE_NONE_CC typedef PyObject* (*py_tail_call_funcptr)(TAIL_CALL_PARAMS);

-#   define TARGET(op) Py_PRESERVE_NONE_CC PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
+#   define TARGET(op) Py_PRESERVE_NONE_CC Py_SKIP_ZERO_USED_REGS PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
 #   define DISPATCH_GOTO() \
         do { \

Given that GCC behaves the way I propose I chose to open the issue instead of proposing that patch for Python, but I'm happy to open an issue over there if you think it's worth discussing over there.

@pinskia
Copy link

pinskia commented Mar 4, 2025

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101892 is requesting the opposite.

@Fidget-Spinner
Copy link

@nelhage Thanks for the investigation! Could you please open an enhancement ticket against CPython as well? I'd be happy to review it.

@rnk
Copy link
Collaborator

rnk commented Mar 10, 2025

Thanks for the report!

The zero_call_used_regs attribute is documented to have two purposes:

  1. improving security by reducing information leakage
  2. reducing the number of useful ROP gadgets in the program.

Indirect tail calls are jump gadgets, not return gadgets, technically, but it seems reasonable to consider if this would increase JOP gadget availability. All things considered, I think the value of goal 1 is higher than goal 2, so if it doesn't impact information leakage, we can probably do it. There are a wide variety of more effective control flow integrity technologies (endbr, PAC, all the others) one can enable that are more effective than trying to reduce gadget availability if that's what the user cares about.

The fact that we have an opt out attribute makes me hesitant to implement this behavior and potentially leak information through scratch registers. I don't have a lot of first-hand experience with these information leak stopping flags, they feel a little bit like security theater, since you can take a signal at any point and write out the entire register context into memory at any instruction boundary. I'm really not clear on the threat model.

Perhaps more practically, one way this could go wrong is, sometimes LLVM decides to use push to adjust the stack in the prologue when optimizing for size. I could easily imagine that the typical unused scratch registers (RCX, RAX) would immediately be spilled to memory, and could then be read later.

I see that @bwendling added the flag , maybe he and @nickdesaulniers have opinions. The review might have some more notes on the design goals.

Another question, how did this flag end up in the CPython build? GitHub cpython codesearch suggests it's not explicitly mentioned there. Has -fzero-call-used-regs just been added to the pantheon of security flags that ~all distros use for all packages, like Fortify, and the rest?

@thesamesam
Copy link
Member

No, it's not a standard distro hardening flag. I've heard nixpkgs may be doing it in one of the posts about this regression but I've not actually found where/if they are, or if it's opt-in per-package instead.

@nelhage
Copy link
Author

nelhage commented Mar 11, 2025

Another question, how did this flag end up in the CPython build?

I ran into this because I've been testing on nix, which default-enables it for the nixpkgs compilers. I'm not aware of it being default-on in other environments, but the fact that one distro did it makes me assume there's some chance someone else will.

@nelhage
Copy link
Author

nelhage commented Mar 11, 2025

Docs link for nix stdenv: https://github.com/NixOS/nixpkgs/blob/master/doc/stdenv/stdenv.chapter.md#zerocallusedregs-zerocallusedregs
Code is in this script, which is sourced from their compiler wrapper script: https://github.com/NixOS/nixpkgs/blob/master/pkgs/build-support/cc-wrapper/add-hardening.sh#L129

@thesamesam
Copy link
Member

Thanks, I'd missed that it's an enabled-by-default option in there (I'd found the file but hadn't read it right).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants