Skip to content

8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs #25609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/hotspot/cpu/aarch64/globals_aarch64.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,10 @@ define_pd_global(intx, InlineSmallCode, 1000);
"Use CRC32 instructions for CRC32 computation") \
product(bool, UseCryptoPmullForCRC32, false, \
"Use Crypto PMULL instructions for CRC32 computation") \
product(intx, CryptoPmullForCRC32LowLimit, 256, DIAGNOSTIC, \
"Minimum size in bytes when Crypto PMULL will be used." \
"Value must be a multiple of 128.") \
range(256, max_jint) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sane to have negative values? If not, use uintx... or maybe even just uint?

product(bool, UseSIMDForMemoryOps, false, \
"Use SIMD instructions in generated memory move code") \
product(bool, UseSIMDForArrayEquals, true, \
Expand Down
21 changes: 10 additions & 11 deletions src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4332,7 +4332,7 @@ void MacroAssembler::kernel_crc32_using_crypto_pmull(Register crc, Register buf,
Label CRC_by4_loop, CRC_by1_loop, CRC_less128, CRC_by128_pre, CRC_by32_loop, CRC_less32, L_exit;
assert_different_registers(crc, buf, len, tmp0, tmp1, tmp2);

subs(tmp0, len, 384);
subs(tmp0, len, CryptoPmullForCRC32LowLimit);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have another alignment sanity check here? It would be both helpful to make sure nobody later breaks your assumption, and could also be helpful for the reader to see the 128 alignment immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alignment does not effect the correctness here, but it should be >= 256. So I added the corresponding assertion above.

mvnw(crc, crc);
br(Assembler::GE, CRC_by128_pre);
BIND(CRC_less128);
Expand All @@ -4346,13 +4346,13 @@ void MacroAssembler::kernel_crc32_using_crypto_pmull(Register crc, Register buf,
b(L_exit);

BIND(CRC_by32_loop);
ldp(tmp0, tmp1, Address(buf));
ldp(tmp0, tmp1, Address(post(buf, 16)));
subs(len, len, 32);
crc32x(crc, crc, tmp0);
ldp(tmp2, tmp3, Address(buf, 16));
ldr(tmp2, Address(post(buf, 8)));
crc32x(crc, crc, tmp1);
add(buf, buf, 32);
ldr(tmp3, Address(post(buf, 8)));
crc32x(crc, crc, tmp2);
subs(len, len, 32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of these changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise: converting these adjustments to post-increment operations isn't obviously an improvement on AArch64 generally. How does it help?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to perf, post-increment ops help to reduce the access to TLB on Ampere1 in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to perf, post-increment ops help to reduce the access to TLB on Ampere1 in this case.

Hmm, but it's code in a rather odd style in shared code. And from what I see, the intrinsic is only 22% of the runtime (for 128 bytes) anyway, and you're making the code larger. I certainly don't want to see this sort of thing proliferating in the intrinsics.

In general, it's up to CPU designers to make simple, straightforward code work well.

How important is this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand this code already exists in CRC32C, so it's simply unifying the two routines. OK, I won't object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're making the code larger.

I don't think this makes the code larger.

How important is this?

As I mentioned in problem 1, this causes a regression (~-14%) on Ampere1 when handling 64 bytes. No obvious effects in other cases though.

so it's simply unifying the two routines.

Yes.

crc32x(crc, crc, tmp3);
br(Assembler::GE, CRC_by32_loop);
cmn(len, (u1)32);
Expand Down Expand Up @@ -4697,7 +4697,7 @@ void MacroAssembler::kernel_crc32c_using_crypto_pmull(Register crc, Register buf
Label CRC_by4_loop, CRC_by1_loop, CRC_less128, CRC_by128_pre, CRC_by32_loop, CRC_less32, L_exit;
assert_different_registers(crc, buf, len, tmp0, tmp1, tmp2);

subs(tmp0, len, 384);
subs(tmp0, len, CryptoPmullForCRC32LowLimit);
br(Assembler::GE, CRC_by128_pre);
BIND(CRC_less128);
subs(len, len, 32);
Expand All @@ -4710,14 +4710,13 @@ void MacroAssembler::kernel_crc32c_using_crypto_pmull(Register crc, Register buf
b(L_exit);

BIND(CRC_by32_loop);
ldp(tmp0, tmp1, Address(buf));
ldp(tmp0, tmp1, Address(post(buf, 16)));
subs(len, len, 32);
crc32cx(crc, crc, tmp0);
ldr(tmp2, Address(buf, 16));
ldr(tmp2, Address(post(buf, 8)));
crc32cx(crc, crc, tmp1);
ldr(tmp3, Address(buf, 24));
ldr(tmp3, Address(post(buf, 8)));
crc32cx(crc, crc, tmp2);
add(buf, buf, 32);
subs(len, len, 32);
crc32cx(crc, crc, tmp3);
br(Assembler::GE, CRC_by32_loop);
cmn(len, (u1)32);
Expand Down
11 changes: 11 additions & 0 deletions src/hotspot/cpu/aarch64/vm_version_aarch64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,11 @@ void VM_Version::initialize() {
ContendedPaddingWidth = dcache_line;
}

if (!(is_aligned(CryptoPmullForCRC32LowLimit, 128))) {
warning("CryptoPmullForCRC32LowLimit must be a multiple of 128");
CryptoPmullForCRC32LowLimit = align_down(CryptoPmullForCRC32LowLimit, 128);
}
Comment on lines +123 to +126
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe somewhere why it has to be a multiple of 128? Imagine someone comes across this later, and wonders if that is just some strange implementation limitation or something more fundamental, or something very subtle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 kinds of loops labeled as CRC_by128_loop, CRC_by32_loop, CRC_by4_loop and CRC_by1_loop. If the flag is 266 which is 128x2+10, then for 265 bytes of inputs, there are 256 bytes that are handled by CRC_by32_loop, while for 266 bytes of inputs, the corresponding 256 bytes are handled by CRC_by128_loop, and I think this cases inconsistency. If CRC_by32_loop handles 256 bytes better than CRC_by128_loop on a platform, it should be used for 266 bytes as well.


if (os::supports_map_sync()) {
// if dcpop is available publish data cache line flush size via
// generic field, otherwise let if default to zero thereby
Expand Down Expand Up @@ -148,6 +153,9 @@ void VM_Version::initialize() {
if (_cpu == CPU_AMPERE && ((_model == CPU_MODEL_AMPERE_1) ||
(_model == CPU_MODEL_AMPERE_1A) ||
(_model == CPU_MODEL_AMPERE_1B))) {
if (FLAG_IS_DEFAULT(UseCryptoPmullForCRC32)) {
FLAG_SET_DEFAULT(UseCryptoPmullForCRC32, true);
}
if (FLAG_IS_DEFAULT(UseSIMDForMemoryOps)) {
FLAG_SET_DEFAULT(UseSIMDForMemoryOps, true);
}
Expand Down Expand Up @@ -265,6 +273,9 @@ void VM_Version::initialize() {
if (FLAG_IS_DEFAULT(UseCryptoPmullForCRC32)) {
FLAG_SET_DEFAULT(UseCryptoPmullForCRC32, true);
}
if (FLAG_IS_DEFAULT(CryptoPmullForCRC32LowLimit)) {
FLAG_SET_DEFAULT(CryptoPmullForCRC32LowLimit, 384);
}
if (FLAG_IS_DEFAULT(CodeEntryAlignment)) {
FLAG_SET_DEFAULT(CodeEntryAlignment, 32);
}
Expand Down