-
Notifications
You must be signed in to change notification settings - Fork 13.4k
-fzero-call-used-regs
should not trigger before tail-calls
#129764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @Fidget-Spinner for visibility on the CPython side. I measured the "2% on diff --git i/Python/ceval_macros.h w/Python/ceval_macros.h
index 1bef2b845d0..cddad845fea 100644
--- i/Python/ceval_macros.h
+++ w/Python/ceval_macros.h
@@ -77,9 +77,10 @@
// Note: [[clang::musttail]] works for GCC 15, but not __attribute__((musttail)) at the moment.
# define Py_MUSTTAIL [[clang::musttail]]
# define Py_PRESERVE_NONE_CC __attribute__((preserve_none))
+# define Py_SKIP_ZERO_USED_REGS __attribute__((zero_call_used_regs("skip")))
Py_PRESERVE_NONE_CC typedef PyObject* (*py_tail_call_funcptr)(TAIL_CALL_PARAMS);
-# define TARGET(op) Py_PRESERVE_NONE_CC PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
+# define TARGET(op) Py_PRESERVE_NONE_CC Py_SKIP_ZERO_USED_REGS PyObject *_TAIL_CALL_##op(TAIL_CALL_PARAMS)
# define DISPATCH_GOTO() \
do { \ Given that GCC behaves the way I propose I chose to open the issue instead of proposing that patch for Python, but I'm happy to open an issue over there if you think it's worth discussing over there. |
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101892 is requesting the opposite. |
@nelhage Thanks for the investigation! Could you please open an enhancement ticket against CPython as well? I'd be happy to review it. |
Thanks for the report! The zero_call_used_regs attribute is documented to have two purposes:
Indirect tail calls are jump gadgets, not return gadgets, technically, but it seems reasonable to consider if this would increase JOP gadget availability. All things considered, I think the value of goal 1 is higher than goal 2, so if it doesn't impact information leakage, we can probably do it. There are a wide variety of more effective control flow integrity technologies ( The fact that we have an opt out attribute makes me hesitant to implement this behavior and potentially leak information through scratch registers. I don't have a lot of first-hand experience with these information leak stopping flags, they feel a little bit like security theater, since you can take a signal at any point and write out the entire register context into memory at any instruction boundary. I'm really not clear on the threat model. Perhaps more practically, one way this could go wrong is, sometimes LLVM decides to use I see that @bwendling added the flag , maybe he and @nickdesaulniers have opinions. The review might have some more notes on the design goals. Another question, how did this flag end up in the CPython build? GitHub cpython codesearch suggests it's not explicitly mentioned there. Has |
No, it's not a standard distro hardening flag. I've heard nixpkgs may be doing it in one of the posts about this regression but I've not actually found where/if they are, or if it's opt-in per-package instead. |
I ran into this because I've been testing on nix, which default-enables it for the nixpkgs compilers. I'm not aware of it being default-on in other environments, but the fact that one distro did it makes me assume there's some chance someone else will. |
Docs link for nix stdenv: https://github.com/NixOS/nixpkgs/blob/master/doc/stdenv/stdenv.chapter.md#zerocallusedregs-zerocallusedregs |
Thanks, I'd missed that it's an enabled-by-default option in there (I'd found the file but hadn't read it right). |
I believe that
-fzero-call-used-regs
should be modified to not clear registers prior to a tail call. Here's my reasoning:With the landing of
clang::musttail
, there's been a bit of a trend towards using indirect tail calls to implement efficient interpreters and parsers; see the original post about protobuf, and CPython's recent new interpreter. This pattern is, in part, an alternative to using computed gotos to implement dispatch within a single large interpreter function.In both cases (computed gotos, and indirect tail calls), the opcode/parser definition generates fairly similar code, ending with an indirect call through a dispatch table. Depending on compiler choices, this turns into (on x86) something like
jmpq *%REG
orjmpq *(%REG1, %REG2, 8)
With
-fzero-call-used-regs
enabled, clang/LLVM currently emit call-used-clearingxor
s prior to the indirect tail-call, but not prior to a computed goto, even one that produces near-identical machine code (example on goldbolt, showing the stylized core of an interpreter loop).Such interpreter loops tend to be extreme hot spots. On CPython, I've measured the cost of
-fzero-call-used-regs=used-gpr
on only the opcode functions at about 2% on the pyperformance suite, when using the tail-call interpreter. It seems surprising and "unfair" to impose this cost on the tail-call style but not the computed goto style of interpreter, when, again, they emit very similar machine code containing similar indirect jumps (and potential JOP gadgets).Also, GCC's implementation behaves in the way I describe, eliding the clearing for tail calls. See a godbolt example -- if you remove the
clang::musttail
and add-fno-optimize-sibling-calls
to the GCC options, thexor
s will reappearThe text was updated successfully, but these errors were encountered: