Skip to content

Consider removing x86_64 field assembly #726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jonasnick opened this issue Mar 24, 2020 · 22 comments
Closed

Consider removing x86_64 field assembly #726

jonasnick opened this issue Mar 24, 2020 · 22 comments

Comments

@jonasnick
Copy link
Contributor

I noticed that turning off x86_64 assembly on my laptop actually speeds up ecdsa_verify. The internal benchmarks show that --without-asm scalar operations are slower, but field operations are faster. In order to investigate this I created a branch that includes the configurable benchmark iterations from #722, a test_matrix.sh script and allows turning on field assembly individually (https://github.com/jonasnick/secp256k1/tree/eval-asm).

Here are the results with gcc 9.3.0 (got similar results with clang 9.0.1):

SECP256K1_BENCH_ITERS=200000

bench config CFLAGS=-DUSE_ASM_X86_64_FIELD ./configure --disable-openssl-tests --with-asm=x86_64
scalar_sqr: min 0.0331us / avg 0.0332us / max 0.0337us
scalar_mul: min 0.0342us / avg 0.0343us / max 0.0345us
field_sqr: min 0.0165us / avg 0.0165us / max 0.0167us
field_mul: min 0.0204us / avg 0.0205us / max 0.0209us
ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
ecdsa_verify: min 56.9us / avg 56.9us / max 56.9us

bench config CFLAGS= ./configure --disable-openssl-tests --without-asm
scalar_sqr: min 0.0375us / avg 0.0376us / max 0.0383us
scalar_mul: min 0.0362us / avg 0.0366us / max 0.0396us
field_sqr: min 0.0152us / avg 0.0152us / max 0.0152us
field_mul: min 0.0177us / avg 0.0178us / max 0.0178us
ecdsa_sign: min 41.8us / avg 41.8us / max 41.9us
ecdsa_verify: min 54.6us / avg 54.7us / max 54.7us

bench config CFLAGS= ./configure --disable-openssl-tests --with-asm=x86_64
scalar_sqr: min 0.0331us / avg 0.0331us / max 0.0333us
scalar_mul: min 0.0342us / avg 0.0343us / max 0.0347us
field_sqr: min 0.0152us / avg 0.0153us / max 0.0154us
field_mul: min 0.0178us / avg 0.0178us / max 0.0180us
ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
ecdsa_verify: min 53.2us / avg 53.2us / max 53.2us

Note the 6.5% ecdsa_verify speedup. However, I don't fully understand this:

  1. There's assembly for field_sqr and field_mul. If we remove it, both functions are faster. But, some other internal functions are slower. For example:
    SECP256K1_BENCH_ITERS=200000
    group_add_affine: min 0.257us / avg 0.257us / max 0.259us
    vs.
    group_add_affine: min 0.263us / avg 0.263us / max 0.264us
    
    This could just be an artifact of micro-benching and I have not tested this with Is the compiler optimizing out some of the benchmarks? #667.
  2. Removing field arithmetic also makes ecdsa verification slower if endomorphism is enabled.
    SECP256K1_BENCH_ITERS=200000
    ecdsa_verify: min 41.1us / avg 41.1us / max 41.1us
    vs.
    ecdsa_verify: min 41.5us / avg 41.6us / max 41.6us
    

It should be noted that without field arithmetic assembly, in order to use 64 bit field arithmetic you need to have __int128 support (or use field=32bit with a 40% verification slowdown). I did not check where this is supported (MSVC?). Also we should try this with older compilers.

@elichai
Copy link
Contributor

elichai commented Mar 24, 2020

Yeah I saw weird similiar things when looked into enabling asm in rust-secp256k1.
But gave up trying to investigate it at some point

@real-or-random
Copy link
Contributor

It would be really interesting to see this with old compilers, and figure out if people need to rely on old compilers. An alternative could be to disable ASM by default but still support it.

@peterdettman
Copy link
Contributor

I'm not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).

@sipa
Copy link
Contributor

sipa commented Mar 24, 2020

MSVC supports neither __int128 or inline GCC-style assembly. I believe all MSVC builds of this code just use 32-bit arithmetic. They have a builtin for 128-bit wide multiplication though, so we could write an analogue of the int128 code using that instead.

@real-or-random
Copy link
Contributor

Another question is (assuming for a moment that performance is identical) whether we prefer the asm code or the C code. There are indeed some reasons to prefer asm, e.g., the compiler can't miscompile it.

@elichai
Copy link
Contributor

elichai commented Mar 29, 2020

I'm not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).

I read through the asm now, it looks like it matches the C code. I think this was done here f048615

FWIW looking in godbolt, the generated asm of the C functions look very much similar to the ones implemented in asm, although looking at llvm-mca, it says that the C code is twice as much cycles, which sounds weird.

ASM: https://godbolt.org/z/YxGUgE
C: https://godbolt.org/z/JPzYbo

@sipa
Copy link
Contributor

sipa commented May 28, 2020

I've done benchmarks on gcc and clang, with various versions, -O2 and -O3, with and without assembly enabled. All benchmarks have --disable-endomorphism --with-bignum=no. The results are kind of interesting, and show they're clearly version dependent, unfortunately.

All tests done on a Intel(R) Core(TM) i7-7820HQ CPU, with clock fixed at 2.3 GHz.

secp256k1-ecdsa-speed-gcc

secp256k1-ecdsa-speed-clang

Raw data:

$ for C in gcc-4.7 gcc-4.9 gcc-5 gcc-6 gcc-7 gcc-8 gcc-9 gcc-10 clang-3.7 clang-3.8 clang-4.0 clang-6.0 clang-7 clang-8 clang-9 clang-10; do for OPT in "-O2" "-O3"; do for ASM in no x86_64; do (git clean -dfx && ./autogen.sh 2>/dev/null && ./configure CC=$C CFLAGS="$OPT" --disable-endomorphism --disable-openssl-tests --enable-benchmark --with-bignum=no --with-asm=$ASM && make -j9 bench_verify tests && ./tests 1) >/dev/null && echo -n "$C $OPT asm=$ASM: " && ./bench_verify; done; done; done) | tee -a ~/secp.log
gcc-4.7 -O2 asm=no: ecdsa_verify: min 131us / avg 131us / max 132us
gcc-4.7 -O2 asm=x86_64: ecdsa_verify: min 107us / avg 107us / max 107us
gcc-4.7 -O3 asm=no: ecdsa_verify: min 133us / avg 134us / max 134us
gcc-4.7 -O3 asm=x86_64: ecdsa_verify: min 106us / avg 106us / max 106us
gcc-4.9 -O2 asm=no: ecdsa_verify: min 125us / avg 125us / max 126us
gcc-4.9 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
gcc-4.9 -O3 asm=no: ecdsa_verify: min 127us / avg 127us / max 127us
gcc-4.9 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
gcc-5 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
gcc-5 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
gcc-5 -O3 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
gcc-5 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
gcc-6 -O2 asm=no: ecdsa_verify: min 113us / avg 113us / max 113us
gcc-6 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
gcc-6 -O3 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
gcc-6 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
gcc-7 -O2 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
gcc-7 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
gcc-7 -O3 asm=no: ecdsa_verify: min 123us / avg 123us / max 123us
gcc-7 -O3 asm=x86_64: ecdsa_verify: min 115us / avg 115us / max 115us
gcc-8 -O2 asm=no: ecdsa_verify: min 108us / avg 108us / max 109us
gcc-8 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
gcc-8 -O3 asm=no: ecdsa_verify: min 117us / avg 117us / max 118us
gcc-8 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
gcc-9 -O2 asm=no: ecdsa_verify: min 106us / avg 106us / max 106us
gcc-9 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
gcc-9 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
gcc-9 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
gcc-10 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
gcc-10 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
gcc-10 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
gcc-10 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 111us / max 111us
clang-3.7 -O2 asm=no: ecdsa_verify: min 122us / avg 122us / max 122us
clang-3.7 -O2 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 113us
clang-3.7 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
clang-3.7 -O3 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 112us
clang-3.8 -O2 asm=no: ecdsa_verify: min 124us / avg 124us / max 124us
clang-3.8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
clang-3.8 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
clang-3.8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
clang-4.0 -O2 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
clang-4.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
clang-4.0 -O3 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
clang-4.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
clang-6.0 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
clang-6.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
clang-6.0 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
clang-6.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
clang-7 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
clang-7 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
clang-7 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
clang-7 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
clang-8 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
clang-8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
clang-8 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 112us
clang-8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
clang-9 -O2 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
clang-9 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
clang-9 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
clang-9 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
clang-10 -O2 asm=no: ecdsa_verify: min 114us / avg 115us / max 115us
clang-10 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
clang-10 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
clang-10 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us

@sipa
Copy link
Contributor

sipa commented May 28, 2020

So eh... should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?

@real-or-random
Copy link
Contributor

Oof. Okay, so the clang performance is not really dependent on the version, so it's more interesting to look at GCC.

I think then that we should keep the asm for now, given that there many people/distros don't have GCC >= 9.

It would be interesting to see what causes the difference in -O2 for GCC 9 vs 10. We switched to -O2 because there were no large differences, and we figured out that -O2 is in fact quicker on GCC (see #700 (comment) for GCC 8). For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

So eh... should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?

PR welcome. 😛

By the way, the green line is missing from the clang graph.

@jonasnick
Copy link
Contributor Author

jonasnick commented May 29, 2020

Nice data @sipa. So our current default options are the best choices for <= gcc-8.

Furthermore, now that we've switched to -O2 by default, with gcc-10 scalar assembly alone is not faster than scalar and field assembly together. As in the OP (gcc-9), with gcc-10's -O3, scalar assembly alone is the fastest configuration.

gcc10 -O3:
ecdsa_verify: avg 58.8us # with assembly
ecdsa_verify: avg 53.2us # without assembly
ecdsa_verify: avg 52.8us # with scalar assembly only

gcc10 -02:
ecdsa_verify: avg 57.6us # with assembly
ecdsa_verify: avg 59.3us # without assembly
ecdsa_verify: avg 58.5us # with scalar assembly only

For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

No idea what changed in gcc-10, but looking at above slopes there is some reason to believe that -O2 will get faster again with gcc-11 😬

@sipa
Copy link
Contributor

sipa commented May 29, 2020

By the way, the green line is missing from the clang graph.

No, it's actually just hidden by the orange line. For all versions of clang, -O2/asm seems indistinguishable from -O3/asm.

@elichai
Copy link
Contributor

elichai commented Jun 4, 2020

FWIW, when compiling to asm filed_5x52_int128_impl.h produces the same ASM for both -O2 and -O3:

$ sed -i "11s/^/#include \"..\/include\/secp256k1.h\" \n#include \"util.h\"\n /" src/field_5x52_int128_impl.h
$ sed -i 's/static//g' src/field_5x52_int128_impl.h
$ gcc -xc -std=c89 -S -O2 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o2.s
$ gcc -xc -std=c89 -S -O3 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o3.s
$ diff o2.s o3.s

If we do actually believe that gcc-10 produce better asm we can easily use the commands above and then just link to these functions no matter the gcc version

@real-or-random
Copy link
Contributor

real-or-random commented Sep 24, 2023

I think by now, we should just turn off asm by default. Compiler-generated code has been faster for a few years, and so it should be better for most users. (related: bitcoin/bitcoin#27897)

Turning it off by default is real progress, and then we can keep procrastinating on whether we want to remove the existing ASM, replace it by CryptOpt, or whatever.

Recent (unscientific) benchmark on my laptop:

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)

clang 16.0.6:
schnorrsig_sign               ,    30.0       ,    30.0       ,    30.2    
schnorrsig_verify             ,    57.3       ,    57.4       ,    57.4

clang, no asm:
schnorrsig_sign               ,    28.7       ,    28.7       ,    28.8    
schnorrsig_verify             ,    52.4       ,    52.5       ,    52.6

gcc version 13.2.1, asm:
schnorrsig_sign               ,    30.9       ,    31.2       ,    33.3    
schnorrsig_verify             ,    59.0       ,    59.1       ,    59.1    

gcc, no asm:
schnorrsig_sign               ,    29.1       ,    29.1       ,    29.1    
schnorrsig_verify             ,    54.2       ,    54.2       ,    54.3

@real-or-random
Copy link
Contributor

@theStack Are you willing to run some detailed benchmarks here with recent compilers?

@fanquake
Copy link
Member

fanquake commented Oct 2, 2023

cc @jamesob (might also be interested in benching)

@real-or-random
Copy link
Contributor

We've just discussed this on IRC (log. A short summary is:

  • A good way to come to a decision is to run a benchmark on the GCC used for the official builds of Bitcoin Core. Currently, this is GCC 10.5.0 So if GCC 10.5.0 is faster than our ASM, we should do something now. Otherwise, we should wait for guix: use GCC 12.3.0 to build releases bitcoin/bitcoin#27897, and then recheck with GCC 12.3.

  • We can't just change the default of the ASM build setting to "off" because this setting affects also the scalar ASM. The scalar ASM is probably still better because its performance relies on ADC instructions, and compilers are still bad. (But the truth is, clang got better here, and also GCC 14 will become better, and we should do benchmarks.) We could, however, replace the current field ASM by compiler-generated ASM, or just remove it. Or we could even introduce separate settings for field and scalar ASM.

In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

@theStack
Copy link
Contributor

theStack commented Nov 23, 2023

In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

I've ran some benchmarks with GCC 10.5.0 on master (commit e721039) and see the verification routines for both ECDSA and Schnorr signatures are significantly faster without the ASM optimizations (>10%). As assumed in the recent IRC discussion though, the scalar routines still perform better with ASM:

$ gcc-10 --version | head -1
gcc-10 (Ubuntu 10.5.0-1ubuntu1~22.04) 10.5.0

$ ./configure --with-asm=x86_64 CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

ecdsa_verify                  ,    33.0       ,    33.5       ,    34.2    
ecdsa_sign                    ,    25.4       ,    25.9       ,    27.1    
schnorrsig_sign               ,    19.1       ,    19.3       ,    19.5    
schnorrsig_verify             ,    33.6       ,    34.0       ,    34.5    

scalar_mul                    ,     0.0289    ,     0.0305    ,     0.0332 
scalar_split                  ,     0.120     ,     0.125     ,     0.133  
field_mul                     ,     0.0173    ,     0.0179    ,     0.0189 
field_sqrt                    ,     4.27      ,     4.37      ,     4.62   

$ ./configure --with-asm=no CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

ecdsa_verify                  ,    29.7       ,    30.5       ,    31.1    
ecdsa_sign                    ,    23.8       ,    24.3       ,    24.9    
schnorrsig_sign               ,    17.5       ,    17.9       ,    18.4    
schnorrsig_verify             ,    30.1       ,    30.4       ,    31.3    

scalar_mul                    ,     0.0339    ,     0.0393    ,     0.0616 
scalar_split                  ,     0.141     ,     0.155     ,     0.192  
field_mul                     ,     0.0133    ,     0.0138    ,     0.0151 
field_sqrt                    ,     3.84      ,     3.95      ,     4.22   

So it seems, given that others can confirm these results, that at least the field ASM could/should be removed (or replaced by compiler-generated ASM)? Also happy to run more benchmarks with newer version of gcc and clang.

@real-or-random
Copy link
Contributor

Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?

@theStack
Copy link
Contributor

theStack commented Nov 23, 2023

Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?

Sure, good point. I've re-ran my benchmarks (this time with the "sign" parameter added to the bench call) and updated the results above accordingly. Seems that the sign operations for both ECDSA and Schnorr are also consistently faster on my machine without ASM on GCC 10.5.0.

real-or-random added a commit to real-or-random/secp256k1 that referenced this issue Nov 24, 2023
Widely available versions of GCC and Clang beat our field asm on -O2.
In particular, GCC 10.5.0, which is Bitcoin Core's current compiler
for official x86_64 builds, produces code that is > 20% faster for
fe_mul and > 10% faster for signature verification (see bitcoin-core#726).

These are the alternatives to this PR:

We could replace our current asm with the fastest compiler output
that we can find. This is potentially faster, but it has multiple
drawbacks:
 - It's more coding work because it needs detailed benchmarks (e.g.,
   with many compiler/options).
 - It's more review work because we need to deal with inline asm
   (including clobbers etc.) and there's a lack of experts reviewers
   in this area.
 - It's not unlikely that we'll fall behind again in a few compiler
   versions, and then we have to deal with this again, i.e., redo the
   benchmarks. Given our history here, I doubt that we'll revolve
   this timely.

We could change the default of the asm build option to off. But this
will also disable the scalar asm, which is still faster.

We could split the build option into two separate options for field
and scalar asm and only disable the field asm by default. But this
adds complexity to the build and to the test matrix.

My conclusion is that this PR gets the low-hanging fruit in terms of
performance. It simplifies our code significantly. It's clearly an
improvement, and it's very easy to review. Whether re-introducing
better asm (whether from a compiler or from CryptOpt) is worth the
hassle can be evaluated separately, and should not hold up this
improvement.

Solves bitcoin-core#726.
@real-or-random
Copy link
Contributor

nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification...

real-or-random added a commit that referenced this issue Nov 24, 2023
1ddd76a bench: add --help option to bench_internal (Sebastian Falbesoner)

Pull request description:

  While coming up with commands for running the benchmarks for issue #726 (comment), I noticed that in contrast to `bench{_ecmult}`, `bench_internal` doesn't have a help option yet and figured it would be nice to have one. A comparable past PR is #1008. Benchmark categories appear in the same order as they are executed, the concrete benchmark names in parantheses per category are listed in alphabetical order.

ACKs for top commit:
  real-or-random:
    utACK 1ddd76a
  siv2r:
    ACK 1ddd76a, tested the `--help` option locally, and it works as expected.

Tree-SHA512: d117641a5f25a7cbf83881f3acceae99624528a0cbb2405efdbe1a3a2762b4d6b251392e954aaa32f6771069d31143743770fccafe198084c12258dedb0856fc
@theStack
Copy link
Contributor

nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification...

Ah, that was indeed unintended. Fixed that for the two other benchmarks I ran at the course of reviewing #1446 (#1446 (review)).

@real-or-random
Copy link
Contributor

We have removed the field assembly in #1446

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants