-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Consider removing x86_64 field assembly #726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah I saw weird similiar things when looked into enabling asm in |
It would be really interesting to see this with old compilers, and figure out if people need to rely on old compilers. An alternative could be to disable ASM by default but still support it. |
I'm not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h). |
MSVC supports neither __int128 or inline GCC-style assembly. I believe all MSVC builds of this code just use 32-bit arithmetic. They have a builtin for 128-bit wide multiplication though, so we could write an analogue of the int128 code using that instead. |
Another question is (assuming for a moment that performance is identical) whether we prefer the asm code or the C code. There are indeed some reasons to prefer asm, e.g., the compiler can't miscompile it. |
I read through the asm now, it looks like it matches the C code. I think this was done here f048615 FWIW looking in godbolt, the generated asm of the C functions look very much similar to the ones implemented in asm, although looking at llvm-mca, it says that the C code is twice as much cycles, which sounds weird. ASM: https://godbolt.org/z/YxGUgE |
I've done benchmarks on gcc and clang, with various versions, All tests done on a Intel(R) Core(TM) i7-7820HQ CPU, with clock fixed at 2.3 GHz. Raw data:
|
So eh... should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly? |
Oof. Okay, so the clang performance is not really dependent on the version, so it's more interesting to look at GCC. I think then that we should keep the asm for now, given that there many people/distros don't have GCC >= 9. It would be interesting to see what causes the difference in -O2 for GCC 9 vs 10. We switched to -O2 because there were no large differences, and we figured out that -O2 is in fact quicker on GCC (see #700 (comment) for GCC 8). For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?
PR welcome. 😛 By the way, the green line is missing from the clang graph. |
Nice data @sipa. So our current default options are the best choices for <= gcc-8. Furthermore, now that we've switched to -O2 by default, with gcc-10 scalar assembly alone is not faster than scalar and field assembly together. As in the OP (gcc-9), with gcc-10's -O3, scalar assembly alone is the fastest configuration.
No idea what changed in gcc-10, but looking at above slopes there is some reason to believe that |
No, it's actually just hidden by the orange line. For all versions of clang, -O2/asm seems indistinguishable from -O3/asm. |
FWIW, when compiling to asm $ sed -i "11s/^/#include \"..\/include\/secp256k1.h\" \n#include \"util.h\"\n /" src/field_5x52_int128_impl.h
$ sed -i 's/static//g' src/field_5x52_int128_impl.h
$ gcc -xc -std=c89 -S -O2 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o2.s
$ gcc -xc -std=c89 -S -O3 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o3.s
$ diff o2.s o3.s If we do actually believe that gcc-10 produce better asm we can easily use the commands above and then just link to these functions no matter the gcc version |
I think by now, we should just turn off asm by default. Compiler-generated code has been faster for a few years, and so it should be better for most users. (related: bitcoin/bitcoin#27897) Turning it off by default is real progress, and then we can keep procrastinating on whether we want to remove the existing ASM, replace it by CryptOpt, or whatever. Recent (unscientific) benchmark on my laptop:
|
@theStack Are you willing to run some detailed benchmarks here with recent compilers? |
cc @jamesob (might also be interested in benching) |
We've just discussed this on IRC (log. A short summary is:
In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks). |
I've ran some benchmarks with GCC 10.5.0 on master (commit e721039) and see the verification routines for both ECDSA and Schnorr signatures are significantly faster without the ASM optimizations (>10%). As assumed in the recent IRC discussion though, the scalar routines still perform better with ASM:
So it seems, given that others can confirm these results, that at least the field ASM could/should be removed (or replaced by compiler-generated ASM)? Also happy to run more benchmarks with newer version of gcc and clang. |
Thanks, just a quick remark: Can you also check the |
Sure, good point. I've re-ran my benchmarks (this time with the "sign" parameter added to the |
Widely available versions of GCC and Clang beat our field asm on -O2. In particular, GCC 10.5.0, which is Bitcoin Core's current compiler for official x86_64 builds, produces code that is > 20% faster for fe_mul and > 10% faster for signature verification (see bitcoin-core#726). These are the alternatives to this PR: We could replace our current asm with the fastest compiler output that we can find. This is potentially faster, but it has multiple drawbacks: - It's more coding work because it needs detailed benchmarks (e.g., with many compiler/options). - It's more review work because we need to deal with inline asm (including clobbers etc.) and there's a lack of experts reviewers in this area. - It's not unlikely that we'll fall behind again in a few compiler versions, and then we have to deal with this again, i.e., redo the benchmarks. Given our history here, I doubt that we'll revolve this timely. We could change the default of the asm build option to off. But this will also disable the scalar asm, which is still faster. We could split the build option into two separate options for field and scalar asm and only disable the field asm by default. But this adds complexity to the build and to the test matrix. My conclusion is that this PR gets the low-hanging fruit in terms of performance. It simplifies our code significantly. It's clearly an improvement, and it's very easy to review. Whether re-introducing better asm (whether from a compiler or from CryptOpt) is worth the hassle can be evaluated separately, and should not hold up this improvement. Solves bitcoin-core#726.
nit: I think you wanted to run the benchmark for |
1ddd76a bench: add --help option to bench_internal (Sebastian Falbesoner) Pull request description: While coming up with commands for running the benchmarks for issue #726 (comment), I noticed that in contrast to `bench{_ecmult}`, `bench_internal` doesn't have a help option yet and figured it would be nice to have one. A comparable past PR is #1008. Benchmark categories appear in the same order as they are executed, the concrete benchmark names in parantheses per category are listed in alphabetical order. ACKs for top commit: real-or-random: utACK 1ddd76a siv2r: ACK 1ddd76a, tested the `--help` option locally, and it works as expected. Tree-SHA512: d117641a5f25a7cbf83881f3acceae99624528a0cbb2405efdbe1a3a2762b4d6b251392e954aaa32f6771069d31143743770fccafe198084c12258dedb0856fc
Ah, that was indeed unintended. Fixed that for the two other benchmarks I ran at the course of reviewing #1446 (#1446 (review)). |
We have removed the field assembly in #1446 |
I noticed that turning off x86_64 assembly on my laptop actually speeds up
ecdsa_verify
. The internal benchmarks show that--without-asm
scalar operations are slower, but field operations are faster. In order to investigate this I created a branch that includes the configurable benchmark iterations from #722, a test_matrix.sh script and allows turning on field assembly individually (https://github.com/jonasnick/secp256k1/tree/eval-asm).Here are the results with gcc 9.3.0 (got similar results with clang 9.0.1):
Note the 6.5% ecdsa_verify speedup. However, I don't fully understand this:
It should be noted that without field arithmetic assembly, in order to use 64 bit field arithmetic you need to have
__int128
support (or use field=32bit with a 40% verification slowdown). I did not check where this is supported (MSVC?). Also we should try this with older compilers.The text was updated successfully, but these errors were encountered: