Skip to content

Updating xarch to utilize EVEX compares and blending where profitable #116983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Jun 24, 2025

This updates the xarch intrinsic logic to always import nodes as TYP_MASK where supported and to lower them back to the non-mask variants if no other optimizations were allowed to kick-in. This allows better overall use of the hardware for existing intrinsic code paths.

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 24, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding added the NO-REVIEW Experimental/testing PR, do NOT review it label Jun 24, 2025
@tannergooding tannergooding removed the NO-REVIEW Experimental/testing PR, do NOT review it label Jul 1, 2025
@tannergooding tannergooding marked this pull request as ready for review July 1, 2025 19:47
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib. This should be ready for review. I could split this up into 2 PRs ([Allow rewriting of hwintrinsic mask ops to their non-mask forms and [Default to using mask ops for V128/V256 on supporting hardware), but I think it's worth doing these two together.

This is one of the last major milestones for the embedded masking support and helps ensure that all vector sizes are getting the expected implicit lightup.

@tannergooding tannergooding requested a review from EgorBo July 1, 2025 19:50
// Return Value:
// true if the node lowering instruction has a EVEX embedded masking support
//
bool GenTree::isEmbeddedMaskingCompatible(Compiler* comp, unsigned tgtMaskSize, CorInfoType& tgtSimdBaseJitType) const
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just pulling the logic from lowering to here so it can be reused in the two places that need it.

Comment on lines +659 to +676
if (Lowering::IsInvariantInRange(op2, node, comp, scratchSideEffects))
{
unsigned tgtMaskSize = simdSize / genTypeSize(simdBaseType);
CorInfoType tgtSimdBaseJitType = CORINFO_TYPE_UNDEF;

if (op2->isEmbeddedMaskingCompatible(comp, tgtMaskSize, tgtSimdBaseJitType))
{
// We are going to utilize the embedded mask, so we don't need to rewrite. However,
// we want to fixup the simdBaseJitType here since it simplifies lowering and allows
// both embedded broadcast and the mask to be live simultaneously.

if (tgtSimdBaseJitType != CORINFO_TYPE_UNDEF)
{
op2->AsHWIntrinsic()->SetSimdBaseJitType(tgtSimdBaseJitType);
}
return;
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the cleanest/easiest way I could think of to do this.

The general idea is that we have several transforms we want to make to LIR before containment happens (because containment complicates these transforms). However, since we're in LIR form for these nodes at this point, we need to make sure the transform is safe to do.

An alternative would be to add a pre pass for lowering then do a separate post pass for containment, but that feels like a bigger/bulkier/riskier change.

There are some other transforms that would be nice to move "here" longer term, like the sequential insertps operation folding we do; or the recognizing of AND(x, NOT(y)) we do; neither of which we want to do in HIR because it breaks or massively complicates other optimizations (like folding and operation negation) that we do.

I'd be happy to make this a separate pass for .NET 11, if we feel that is better. But for .NET 10 I think this is a less risky approach.

@tannergooding
Copy link
Member Author

The size regressions that are showing are primarily from cases where we decide to fallback to the non-kmask variant and it is comparing against zero. Such cases are now having to emit an extra vxorps since it isn't CSE'd.

In other words, the few regressions are namely due to #70182. We might be able to mitigate some of that by finding an existing CSE with a good VN in range or doing some other backtrack searching tricks, like we've done as workarounds for the issue; but I don't think we should block this PR on that; particularly since many important cases are improved and the vxorps are elided by the register renamer since its just producing 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant