-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8358032: Use crypto pmull for CRC32(C) on Ampere CPU and improve for short inputs #25609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 3 commits
b1f4e1f
8aa9657
db926eb
9b2bae6
df9f920
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4332,7 +4332,7 @@ void MacroAssembler::kernel_crc32_using_crypto_pmull(Register crc, Register buf, | |
Label CRC_by4_loop, CRC_by1_loop, CRC_less128, CRC_by128_pre, CRC_by32_loop, CRC_less32, L_exit; | ||
assert_different_registers(crc, buf, len, tmp0, tmp1, tmp2); | ||
|
||
subs(tmp0, len, 384); | ||
subs(tmp0, len, CryptoPmullForCRC32LowLimit); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it make sense to have another alignment sanity check here? It would be both helpful to make sure nobody later breaks your assumption, and could also be helpful for the reader to see the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the alignment does not effect the correctness here, but it should be >= 256. So I added the corresponding assertion above. |
||
mvnw(crc, crc); | ||
br(Assembler::GE, CRC_by128_pre); | ||
BIND(CRC_less128); | ||
|
@@ -4346,13 +4346,13 @@ void MacroAssembler::kernel_crc32_using_crypto_pmull(Register crc, Register buf, | |
b(L_exit); | ||
|
||
BIND(CRC_by32_loop); | ||
ldp(tmp0, tmp1, Address(buf)); | ||
ldp(tmp0, tmp1, Address(post(buf, 16))); | ||
subs(len, len, 32); | ||
crc32x(crc, crc, tmp0); | ||
ldp(tmp2, tmp3, Address(buf, 16)); | ||
ldr(tmp2, Address(post(buf, 8))); | ||
crc32x(crc, crc, tmp1); | ||
add(buf, buf, 32); | ||
ldr(tmp3, Address(post(buf, 8))); | ||
crc32x(crc, crc, tmp2); | ||
subs(len, len, 32); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the point of these changes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be more precise: converting these adjustments to post-increment operations isn't obviously an improvement on AArch64 generally. How does it help? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to perf, post-increment ops help to reduce the access to TLB on Ampere1 in this case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hmm, but it's code in a rather odd style in shared code. And from what I see, the intrinsic is only 22% of the runtime (for 128 bytes) anyway, and you're making the code larger. I certainly don't want to see this sort of thing proliferating in the intrinsics. In general, it's up to CPU designers to make simple, straightforward code work well. How important is this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the other hand this code already exists in CRC32C, so it's simply unifying the two routines. OK, I won't object. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think this makes the code larger.
As I mentioned in problem 1, this causes a regression (~-14%) on Ampere1 when handling 64 bytes. No obvious effects in other cases though.
Yes. |
||
crc32x(crc, crc, tmp3); | ||
br(Assembler::GE, CRC_by32_loop); | ||
cmn(len, (u1)32); | ||
|
@@ -4697,7 +4697,7 @@ void MacroAssembler::kernel_crc32c_using_crypto_pmull(Register crc, Register buf | |
Label CRC_by4_loop, CRC_by1_loop, CRC_less128, CRC_by128_pre, CRC_by32_loop, CRC_less32, L_exit; | ||
assert_different_registers(crc, buf, len, tmp0, tmp1, tmp2); | ||
|
||
subs(tmp0, len, 384); | ||
subs(tmp0, len, CryptoPmullForCRC32LowLimit); | ||
br(Assembler::GE, CRC_by128_pre); | ||
BIND(CRC_less128); | ||
subs(len, len, 32); | ||
|
@@ -4710,14 +4710,13 @@ void MacroAssembler::kernel_crc32c_using_crypto_pmull(Register crc, Register buf | |
b(L_exit); | ||
|
||
BIND(CRC_by32_loop); | ||
ldp(tmp0, tmp1, Address(buf)); | ||
ldp(tmp0, tmp1, Address(post(buf, 16))); | ||
subs(len, len, 32); | ||
crc32cx(crc, crc, tmp0); | ||
ldr(tmp2, Address(buf, 16)); | ||
ldr(tmp2, Address(post(buf, 8))); | ||
crc32cx(crc, crc, tmp1); | ||
ldr(tmp3, Address(buf, 24)); | ||
ldr(tmp3, Address(post(buf, 8))); | ||
crc32cx(crc, crc, tmp2); | ||
add(buf, buf, 32); | ||
subs(len, len, 32); | ||
crc32cx(crc, crc, tmp3); | ||
br(Assembler::GE, CRC_by32_loop); | ||
cmn(len, (u1)32); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -120,6 +120,11 @@ void VM_Version::initialize() { | |
ContendedPaddingWidth = dcache_line; | ||
} | ||
|
||
if (!(is_aligned(CryptoPmullForCRC32LowLimit, 128))) { | ||
warning("CryptoPmullForCRC32LowLimit must be a multiple of 128"); | ||
CryptoPmullForCRC32LowLimit = align_down(CryptoPmullForCRC32LowLimit, 128); | ||
} | ||
Comment on lines
+123
to
+126
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you describe somewhere why it has to be a multiple of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are 4 kinds of loops labeled as CRC_by128_loop, CRC_by32_loop, CRC_by4_loop and CRC_by1_loop. If the flag is 266 which is 128x2+10, then for 265 bytes of inputs, there are 256 bytes that are handled by CRC_by32_loop, while for 266 bytes of inputs, the corresponding 256 bytes are handled by CRC_by128_loop, and I think this cases inconsistency. If CRC_by32_loop handles 256 bytes better than CRC_by128_loop on a platform, it should be used for 266 bytes as well. |
||
|
||
if (os::supports_map_sync()) { | ||
// if dcpop is available publish data cache line flush size via | ||
// generic field, otherwise let if default to zero thereby | ||
|
@@ -148,6 +153,9 @@ void VM_Version::initialize() { | |
if (_cpu == CPU_AMPERE && ((_model == CPU_MODEL_AMPERE_1) || | ||
(_model == CPU_MODEL_AMPERE_1A) || | ||
(_model == CPU_MODEL_AMPERE_1B))) { | ||
if (FLAG_IS_DEFAULT(UseCryptoPmullForCRC32)) { | ||
FLAG_SET_DEFAULT(UseCryptoPmullForCRC32, true); | ||
} | ||
if (FLAG_IS_DEFAULT(UseSIMDForMemoryOps)) { | ||
FLAG_SET_DEFAULT(UseSIMDForMemoryOps, true); | ||
} | ||
|
@@ -265,6 +273,9 @@ void VM_Version::initialize() { | |
if (FLAG_IS_DEFAULT(UseCryptoPmullForCRC32)) { | ||
FLAG_SET_DEFAULT(UseCryptoPmullForCRC32, true); | ||
} | ||
if (FLAG_IS_DEFAULT(CryptoPmullForCRC32LowLimit)) { | ||
FLAG_SET_DEFAULT(CryptoPmullForCRC32LowLimit, 384); | ||
} | ||
if (FLAG_IS_DEFAULT(CodeEntryAlignment)) { | ||
FLAG_SET_DEFAULT(CodeEntryAlignment, 32); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it sane to have negative values? If not, use
uintx
... or maybe even justuint
?