Skip to content

8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs #25889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vamsi-parasa
Copy link
Contributor

@vamsi-parasa vamsi-parasa commented Jun 19, 2025

The goal of this PR is to enhance the existing x86 assembly stubs using PUSH and POP instructions with paired PUSHP/POPP instructions which are part of Intel APX technology.

In Intel APX, the PUSHP and POPP instructions are modern, compact replacements for the legacy PUSH and POP, designed to work seamlessly with the expanded set of 32 general-purpose registers (R0–R31). Unlike their predecessors, they use the new APX (REX2-based) encoding, enabling more uniform and efficient instruction formats. These instructions improve code density, simplify register access, and are optimized for performance on APX-enabled CPUs.

Pairing PUSHP and POPP in Intel APX provides CPU-level benefits such as more efficient instruction decoding, better stack pointer tracking, and improved register dependency management. Their uniform encoding allows for streamlined execution, reduced pipeline stalls, and potential micro-op fusion, all of which enhance performance and power efficiency. This pairing helps the processor optimize speculative execution and register lifetimes, making code faster and more scalable on modern architectures.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8359965: Enable paired pushp and popp instruction usage for APX enabled CPUs (Sub-task - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25889/head:pull/25889
$ git checkout pull/25889

Update a local copy of the PR:
$ git checkout pull/25889
$ git pull https://git.openjdk.org/jdk.git pull/25889/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25889

View PR using the GUI difftool:
$ git pr show -t 25889

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25889.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 19, 2025

👋 Welcome back sparasa! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 19, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 19, 2025
@openjdk
Copy link

openjdk bot commented Jun 19, 2025

@vamsi-parasa The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jun 19, 2025

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a drive-by comment as this isn't code I normally have much to do with but to me it would look a lot cleaner to define push_paired/pop_paired (maybe abbreviating directly to pushp/popp?) rather than passing the boolean.

@vamsi-parasa
Copy link
Contributor Author

Just a drive-by comment as this isn't code I normally have much to do with but to me it would look a lot cleaner to define push_paired/pop_paired (maybe abbreviating directly to pushp/popp?) rather than passing the boolean.

Hi David (@dholmes-ora),

Thanks for the suggestion!
We're open to changes in the API as suggested by the community. The users need to be aware that push_paired/pop_paired or pushp/popp will fallback to the legacy push/pop instructions if the CPU does not support APX features.

Thanks,
Vamsi

@vpaprotsk
Copy link
Contributor

vpaprotsk commented Jun 26, 2025

Like @dholmes-ora, I also prefer a new function (in MacroAssembler) instead of flags. Though I like the names paired_push/paired_pop..

The shorter pushp/popp might also be acceptable (better readability) though I think I like the longer name (I am more likely to look up the longer function definition to see what it does. The shorter, I might assume is just the regular push/pop.. but it could also fall under the category 'you are supposed to know that')

PS: sed -e "/is_pair/ s|pop(\([^,]*\), true /\*is_pair\*/)|paired_pop(\1)|" -e "/is_pair/ s|push(\([^,]*\), true /\*is_pair\*/)|paired_push(\1)|"

@@ -795,6 +795,22 @@ void MacroAssembler::pop_d(XMMRegister r) {
addptr(rsp, 2 * Interpreter::stackElementSize);
}

void MacroAssembler::push(Register src, bool is_pair) {
if (is_pair && VM_Version::supports_apx_f()) {
pushp(src);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does is_pair signify here ? You are just pushing one register. Do you intend to use has_matching_pop ?

}

void MacroAssembler::pop(Register dst, bool is_pair) {
if (is_pair && VM_Version::supports_apx_f()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, new argument suggestion: please use has_matching_push.
I understand your purpose here is to delegate the responsibility of balancing of PPX pair to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker in the stub snippets using push/pop instruction sequence and wrap the actual assembler call underneath. The idea here is to catch the balancing error upfront as PPX is purely a performance hint. Instructions with this hint have the same functional semantics as those without. PPX hints set by the compiler that violate the balancing rule may turn off the PPX
optimization, but they will not affect program semantics..

class APXPushPopPairTracker {
    private:
        int _counter;
 
    public:
        APXPushPopPairTracker() _counter(0) {
        }

       ~APXPushPopPairTracker() {
           assert(_counter == 0, "Push/pop pair mismatch");
        }
     
        void push(Register reg, bool has_matching_pop) {
            if (has_matching_pop && VM_Version::supports_apx_f()) {
               Assembler::pushp(reg);
               incrementCounter();
            } else {
               Assembler::push(reg);
            }
        }
        void pop(Register reg, bool has_matching_push) {
            if (has_matching_push && VM_Version::supports_apx_f()) {
               Assembler::popp(reg);
               decrementCounter();
            } else {
               Assembler::pop(reg);
            }
        }     
        void incrementCounter() {
          _counter++;
        }
        void decrementCounter() {
           _counter--;
        }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a cleaner interface, I think we can also maintain a RAII style APXPushPopPairTracker ...

Using the suggested code as a base, Vamsi and I tinkered with the idea some more! Here is what we came up with. This also tracks the correct order of registers being pushed/poped.. (haven't compiled it, so might have some syntax bugs).

@dholmes-ora would you mind sharing your opinion? We seem to be making things more complicated, but hopefully in a good way?

Also included a sample usage in a stub.

#define __ _masm->

class PushPopTracker {
   private:
      int _counter;
      MacroAssembler *_masm;
      const int REGS = 32; // Increase as needed
      int regs[REGS];
   public:
      PushPopTracker(MacroAssembler *_masm) : _counter(0), _masm(_masm) {}
      ~PushPopTracker() {
         assert(_counter == 0, "Push/pop pair mismatch");
      }

      void push(Register reg) {
         assert(_counter<REGS, "Push/pop overflow");
         regs[_counter++] = reg.encoding();
         if (VM_Version::supports_apx_f()) {
            __ pushp(reg);
         } else {
            __ push(reg);
         }
      }
      void pop(Register reg) {
         assert(_counter>0, "Push/pop underflow");
         assert(regs[_counter] == reg.encoding(), "Push/pop pair mismatch: %d != %d", regs[_counter], reg.encoding());
         _counter--;
         if (VM_Version::supports_apx_f()) {
            __ popp(reg);
         } else {
            __ pop(reg);
         }
      }
}

address StubGenerator::generate_intpoly_montgomeryMult_P256() {
  __ align(CodeEntryAlignment);
  /*...*/
  address start = __ pc();
  __ enter();
  PushPopTracker s(_masm);
  s.push(r12); //1
  s.push(r13); //2
  s.push(r14); //3
  #ifdef _WIN64
  s.push(rsi); //4
  s.push(rdi); //5
  #endif
  s.push(rbp); //6
  __ movq(rbp, rsp);
  __ andq(rsp, -32);
  __ subptr(rsp, 32);
  // Register Map
  const Register aLimbs  = c_rarg0; // c_rarg0: rdi | rcx
  const Register bLimbs  = rsi;     // c_rarg1: rsi | rdx
  const Register rLimbs  = r8;      // c_rarg2: rdx | r8
  const Register tmp1    = r9;
  const Register tmp2    = r10;
  /*...*/
  __ movq(rsp, rbp);
  s.pop(rbp); //5
  #ifdef _WIN64
  s.pop(rdi); //4
  s.pop(rsi); //3
  #endif
  s.pop(r14); //2
  s.pop(r13); //1
  s.pop(r12); //0
  __ leave();
  __ ret(0);
  return start;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vamsi-parasa, It's better to make this as a subclass of MacroAssembler in src/hotspot/cpu/x86/macroAssembler_x86.hpp and pass Tracker as an argument to push / pop for a cleaner interface.

I don't think its possible? Unless I am missing something..

  • Subclass has an instance of the base class (i.e. the memory allocation of PushPopTracker would have the MacroAssembler base class with extra fields appended); and MacroAssembler has already been allocated (i.e. you can't tack on more fields onto the end of the underlying memory of existing MacroAssembler..)
  • If its a subclass, there is no reason to pass it as a parameter, because it already would have the parent's instance? Also, the extra parameter to push/pop (flag) was what I had originally objected to? (i.e. would like for push/pop to still just take one register as a parameter..)
  • This class is sort of a stripped-down implementation of reference counting; we want the stack-allocated variable (i.e. explicit constructor call) and the implicit destructor calls (i.e. inserted by g++ on all function exits). That is, we must have a stack allocated variable for it to be deallocated (and destructor called for assert check)

Here is an attempt to make it a subclass? And sample usage...

class PushPopTracker : public MacroAssembler {
   private:
      int _counter;
      const int REGS = 32; // Increase as needed
      int regs[REGS];
   public:
   // MacroAssembler(CodeBuffer* code) is the only constructor?
      PushPopTracker() : _counter(0), MacroAssembler(???) {} //FIXME???
      ~PushPopTracker() {
         assert(_counter == 0, "Push/pop pair mismatch");
      }

      void push(Register reg) {
         assert(_counter<REGS, "Push/pop overflow");
         regs[_counter++] = reg.encoding();
         if (VM_Version::supports_apx_f()) {
            Assembler::pushp(reg);
         } else {
            Assembler::push(reg);
         }
      }
/*...*/
}

address StubGenerator::generate_intpoly_montgomeryMult_P256() {
  __ align(CodeEntryAlignment);
  /*...*/
  address start = __ pc();
  __ enter();
  PushPopTracker s(???); //FIXME?
  s.push(r12, /* Extra parm? */); //1

Copy link
Contributor Author

@vamsi-parasa vamsi-parasa Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jatin (@jatin-bhateja) and Vlad (@vpaprotsk),

There's one more issue to be considered. The C++ PushPopTracker code will be run during the stub generation time. There are code bocks which do a single push onto the stack but due to multiple exit paths, there will be multiple pops as illustrated below. Will this reference counting approach not fail in such a scenario as the stub code is generated all at once during the stub generation phase?

#begin stack frame
push(r21)

#exit condition 1
pop(r21)

# exit condition 2
pop(r21)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I had my fun writing an array-backed stack.. (and with David's comment too..) I can admit that the point of the entire C++ Tracker class is to 'just' add an assert; doesn't actually functionally add to the original code, but does add better JIT/stub compile-time checking.

@vamsi-parasa you are right.. if there are ifs and multiple exit paths in the assembler itself.. the Tracker wont be able to catch it (multiple exits paths in the generator are just fine though); I was thinking about this problem too last night... a hack/'solution' would be to disable such checking with a default flag in the constructor... 'fairly trivial' but just adds to the complexity even more. And the assert was the point of the class to begin with... I do think such stubs are rare?

There is some value in improved checking, but enough? Writing stubs is already an 'you should know assembler very well' thing so those checks only improve things marginally overall? As David says, its for the compiler folks to decide :)

@jatin-bhateja
Copy link
Member

/label add hotspot-compiler-dev

@openjdk
Copy link

openjdk bot commented Jul 1, 2025

@jatin-bhateja
The hotspot-compiler label was successfully added.

@dholmes-ora
Copy link
Member

@dholmes-ora would you mind sharing your opinion? We seem to be making things more complicated, but hopefully in a good way?

Seems very complicated to me. Really this is for compiler folk to discuss. And as noted above this "tracker" class only helps where the push/pop are paired in the same scope. Personally I think a "pushp" that is defined to be a "push-paired" when available, else a regular "push", would suffice in terms of API design. But again this is for compiler folk to determine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants