-
Notifications
You must be signed in to change notification settings - Fork 37.6k
Don't zero-after-free DataStream
: Faster IBD on some configurations
#30987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Don't zero-after-free DataStream
: Faster IBD on some configurations
#30987
Conversation
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30987. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
🚧 At least one of the CI tasks failed. HintsMake sure to run all tests locally, according to the documentation. The failure may happen due to a number of reasons, for example:
Leave a comment here, if you need help tracking down a confusing failure. |
d90eb00
to
5cf2fef
Compare
Appendix
|
The CI failure on ARM is related, and I am able to reproduce locally. It is from a |
I tested this PR on an AMD Ryzen 7950x machine with Ubuntu 24.04, with one local network peer and a gigabit internet connection.
Before: 5 hours 10 minutes Time includes 20 minutes to flush chainstate to disk during shutdown. That's nowhere near a 25% difference and probably not statistically significant. Some configurations might do better than others from this change. It's certainly not worse. The node was additionally patched to drop the |
DataStream
: ~25% faster IBDDataStream
: ~25% faster IBD on some configurations
@davidgumberg Thanks for the very detailed description! I suspect the speedup here is going to be very dependent on the architecture/environment, but it's not clear to me exactly what variables would matter most. Would you be interested in working up a specific zero-after-free benchmark that illustrates the difference? If we knew exactly where the speedup was coming from, we could make a more informed decision about what/where to optimize. |
@Sjors Thanks for testing and showing that the benefit (if any) here is setup-dependent. I suspect that the biggest improvements will be seen on memory bandwidth constrained systems and when flushing/syncing coinsdb to disk, which you have skipped most of by running such a high dbcache and unpruned, so I think that your setup is worst case scenario for improvements seen from this change but this could be my own wishful thinking!
@theuni Good point, especially given Sjors result, I will draft up a zero-after-free benchmark to isolate the causes / relevant factors of any performance benefit that might be here. I think a |
DataStream
: ~25% faster IBD on some configurationsDataStream
: Faster IBD on some configurations
10a5836
to
d12a149
Compare
d12a149
to
1e6cb11
Compare
d146017
to
3c5d8d4
Compare
3c5d8d4
to
0a3bae7
Compare
The removed passage, introduced with this benchmark in PR#16267(bitcoin#16267) appears to have been copied and pasted from the earlier block tests in `bench/checkblock.cpp`. (bitcoin#9049) There, it is relevant to prevent triggering what seems to be a vestigial branch of DataStream::Rewind() related to the unused DataStream::Compact(). While harmless, it is removed because it can trigger a spurious bounds warning in GCC <12.3 & <11.4. This issue was previously worked around in c78d8ff (PR#30765). GCC Bugzilla issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100366
Avoid using BOOST_CHECK_EQUAL_COLLECTIONS for two std::vector<std::pair<SerializeData, SerializeData>> since boost expects printing methods that are very brittle. See: * https://www.boost.org/doc/libs/1_86_0/libs/test/doc/html/boost_test/test_output/test_tools_support_for_logging/testing_tool_output_disable.html * https://stackoverflow.com/questions/10976130/boost-check-equal-with-pairint-int-and-custom-operator * https://stackoverflow.com/questions/3999644/how-to-compare-vectors-with-boost-test * boostorg/type_traits#196
`SerializeData` is used by `DataStream` which is used throughout the codebase for non-secret data. Originally introduced as a mitigation for buffer-overflows. The potential mitigation is not worth the performance cost. This slows down IBD by as much as ~25%.
0a3bae7
to
906e67b
Compare
Fixed the spurious array bounds warning that occurs on Debian because it uses GCC 12.2, which has a bug where some uses of As suggested by @theuni, I've added some benchmarks that help show where the performance improvement is coming from. In a test where a 1000-byte I also made a branch(davidgumberg@c832fed) with a version of the zero after free allocator that keeps the compiler optimization prevention, but doesn't actually memset the stream to zero, and performance in some cases is only slightly better than master. For example, in the same test as above, it managed ~4.72 GB/s on the 7900x, on the Raspberry Pi performance of this "partial zero-after-free" branch was closer to my no-zeroing branch, getting ~6.95GB/s. This seems to hint that a large part of the performance issue here isn't just from zeroing memory with I ran the benchmarks on three devices, and the data is below, the most curious result is from the Ryzen 7640U w/ 5600 MT/s memory, which showed the least improvement between the master, "partial zero", and my branch. Repeated runs were noisy, I used
|
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
20.67 | 48,369,994.69 | 0.6% | 61.83 | 47.04 | 1.314 | 8.87 | 0.8% | 68.71 | CCoinsViewDBFlush |
0.86 | 1,165,414,983.28 | 0.0% | 5.14 | 2.06 | 2.498 | 1.04 | 0.0% | 66.01 | DataStreamAlloc |
0.18 | 5,416,728,210.85 | 0.1% | 1.26 | 0.44 | 2.839 | 0.25 | 0.0% | 66.00 | DataStreamSerializeScript |
9.06 | 110,322,628.58 | 0.1% | 32.40 | 21.69 | 1.493 | 6.06 | 0.7% | 66.09 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
4,319,764.64 | 231.49 | 0.2% | 21,518,823.54 | 10,347,746.41 | 2.080 | 3,965,023.40 | 1.1% | 64.06 | DeserializeAndCheckBlockTest |
2,983,304.65 | 335.20 | 0.1% | 14,726,319.41 | 7,146,940.53 | 2.061 | 2,622,747.10 | 0.7% | 66.04 | DeserializeBlockTest |
Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.
~/bitcoin $ git checkout --detach $partzero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 0351c4242a Modify zero after free allocator to prevent optimizations without zeroing memory
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
20.64 | 48,461,301.71 | 0.5% | 61.84 | 46.60 | 1.327 | 8.87 | 0.8% | 68.28 | CCoinsViewDBFlush |
0.84 | 1,183,775,230.65 | 0.0% | 5.08 | 2.03 | 2.505 | 1.02 | 0.0% | 66.02 | DataStreamAlloc |
0.14 | 6,951,563,016.33 | 0.0% | 1.13 | 0.35 | 3.273 | 0.21 | 0.0% | 66.00 | DataStreamSerializeScript |
9.45 | 105,798,798.06 | 0.3% | 46.75 | 22.67 | 2.062 | 8.46 | 0.5% | 66.14 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
4,172,021.00 | 239.69 | 0.1% | 21,543,066.84 | 9,993,817.02 | 2.156 | 3,988,350.18 | 1.0% | 63.92 | DeserializeAndCheckBlockTest |
2,919,977.25 | 342.47 | 0.0% | 14,750,310.48 | 6,994,754.12 | 2.109 | 2,646,087.06 | 0.5% | 66.07 | DeserializeBlockTest |
My PR branch with no zero-after-free allocator:
~/bitcoin $ git checkout --detach $nozero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 906e67b951 refactor: Drop unused `zero_after_free_allocator`
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
20.89 | 47,868,766.30 | 0.7% | 60.74 | 47.24 | 1.286 | 9.12 | 0.9% | 69.52 | CCoinsViewDBFlush |
0.04 | 27,639,502,423.73 | 0.0% | 0.20 | 0.09 | 2.312 | 0.04 | 0.0% | 66.02 | DataStreamAlloc |
0.14 | 7,030,720,015.31 | 0.0% | 1.09 | 0.34 | 3.203 | 0.22 | 0.0% | 66.03 | DataStreamSerializeScript |
8.46 | 118,171,923.30 | 0.1% | 29.40 | 20.25 | 1.452 | 5.06 | 0.8% | 66.06 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
4,111,234.73 | 243.24 | 0.1% | 21,519,664.21 | 9,847,208.26 | 2.185 | 3,965,210.98 | 1.0% | 63.80 | DeserializeAndCheckBlockTest |
2,857,220.97 | 349.99 | 0.1% | 14,727,090.03 | 6,843,201.05 | 2.152 | 2,622,831.00 | 0.5% | 65.95 | DeserializeBlockTest |
Ryzen 7900x 5200 MT/s DDR5
Original zero-after-free allocator still in use with DataStream
~/bitcoin$ git checkout --detach $yeszero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 772b1f606f test: avoid BOOST_CHECK_EQUAL for complex types
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
6.14 | 162,782,032.39 | 0.7% | 56.96 | 27.77 | 2.051 | 8.25 | 0.7% | 61.54 | CCoinsViewDBFlush |
0.19 | 5,280,744,677.81 | 0.1% | 5.10 | 0.89 | 5.755 | 1.02 | 0.0% | 65.93 | DataStreamAlloc |
0.22 | 4,577,202,378.38 | 0.5% | 5.70 | 1.02 | 5.579 | 1.16 | 0.1% | 66.27 | DataStreamSerializeScript |
2.37 | 422,778,468.05 | 0.2% | 32.39 | 11.06 | 2.929 | 5.12 | 0.6% | 66.04 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
1,319,284.06 | 757.99 | 0.4% | 20,617,084.61 | 6,164,538.66 | 3.344 | 3,706,003.42 | 0.7% | 65.82 | DeserializeAndCheckBlockTest |
879,982.73 | 1,136.39 | 0.4% | 14,213,986.82 | 4,113,201.90 | 3.456 | 2,432,431.24 | 0.2% | 65.87 | DeserializeBlockTest |
Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.
~/btc/bitcoin$ git checkout --detach $partzero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 3bdd43680e Modify zero after free allocator to prevent optimizations without zeroing memory
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
6.24 | 160,226,428.51 | 0.5% | 56.96 | 27.99 | 2.035 | 8.25 | 0.7% | 62.34 | CCoinsViewDBFlush |
0.18 | 5,415,824,062.30 | 0.1% | 5.07 | 0.86 | 5.869 | 1.02 | 0.0% | 65.99 | DataStreamAlloc |
0.21 | 4,715,585,681.78 | 0.1% | 5.62 | 0.99 | 5.664 | 1.14 | 0.1% | 65.93 | DataStreamSerializeScript |
2.36 | 424,307,427.06 | 0.1% | 32.36 | 11.02 | 2.938 | 5.12 | 0.6% | 66.07 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
1,304,195.07 | 766.76 | 0.1% | 20,615,353.83 | 6,096,229.68 | 3.382 | 3,705,797.43 | 0.7% | 66.01 | DeserializeAndCheckBlockTest |
876,218.51 | 1,141.27 | 0.0% | 14,212,309.42 | 4,095,993.88 | 3.470 | 2,431,660.20 | 0.2% | 65.98 | DeserializeBlockTest |
My PR branch with no zero-after-free allocator:
~/btc/bitcoin$ git checkout --detach $nozero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 906e67b951 refactor: Drop unused `zero_after_free_allocator`
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
6.24 | 160,367,026.40 | 0.8% | 57.70 | 28.03 | 2.059 | 8.46 | 0.7% | 62.47 | CCoinsViewDBFlush |
0.01 | 113,328,653,394.82 | 0.0% | 0.12 | 0.04 | 2.854 | 0.02 | 0.0% | 65.69 | DataStreamAlloc |
0.04 | 23,329,286,239.78 | 0.0% | 0.89 | 0.20 | 4.454 | 0.19 | 0.0% | 64.00 | DataStreamSerializeScript |
2.26 | 441,734,425.78 | 0.1% | 29.88 | 10.58 | 2.825 | 4.62 | 0.6% | 65.89 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
1,302,825.68 | 767.56 | 0.2% | 20,617,190.29 | 6,090,178.32 | 3.385 | 3,706,032.36 | 0.7% | 65.93 | DeserializeAndCheckBlockTest |
874,097.45 | 1,144.04 | 0.1% | 14,212,631.31 | 4,085,149.78 | 3.479 | 2,431,804.86 | 0.2% | 66.24 | DeserializeBlockTest |
Ryzen 5 7640U 5600 MT/s DDR5
This run was done with a slightly updated version of CCoinsViewDBFlush
from the above runs.
Original zero-after-free allocator still in use with DataStream
~/bitcoin$ git checkout --detach $yeszero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 970e7822d4 test: avoid BOOST_CHECK_EQUAL for complex types`
ns/coin | coin/s | err% | ins/coin | cyc/coin | IPC | bra/coin | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
1,537.95 | 650,216.26 | 1.0% | 8,199.37 | 5,151.04 | 1.592 | 1,377.36 | 0.8% | 67.93 | CCoinsViewDBFlush |
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
0.02 | 44,829,302,792.99 | 0.2% | 0.16 | 0.08 | 2.009 | 0.03 | 0.0% | 65.87 | DataStreamAlloc |
0.08 | 12,714,010,775.32 | 0.1% | 1.25 | 0.27 | 4.579 | 0.24 | 0.0% | 66.38 | DataStreamSerializeScript |
3.74 | 267,485,270.13 | 0.6% | 31.97 | 12.92 | 2.474 | 5.01 | 0.6% | 63.25 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
2,157,061.67 | 463.59 | 0.8% | 21,937,911.43 | 7,472,463.20 | 2.936 | 3,976,431.11 | 0.7% | 66.21 | DeserializeAndCheckBlockTest |
1,523,202.16 | 656.51 | 0.5% | 16,402,554.71 | 5,276,345.85 | 3.109 | 2,930,545.76 | 0.2% | 66.23 | DeserializeBlockTest |
Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.
~/bitcoin$ git checkout --detach $partzero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at c832feda63 Modify zero after free allocator to prevent optimizations without zeroing memory
ns/coin | coin/s | err% | ins/coin | cyc/coin | IPC | bra/coin | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
1,558.58 | 641,609.73 | 0.4% | 8,200.70 | 5,210.43 | 1.574 | 1,377.66 | 0.8% | 69.12 | CCoinsViewDBFlush |
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
0.01 | 71,945,612,656.56 | 0.1% | 0.12 | 0.05 | 2.567 | 0.02 | 0.0% | 65.35 | DataStreamAlloc |
0.06 | 17,044,987,379.75 | 0.3% | 1.16 | 0.20 | 5.685 | 0.21 | 0.0% | 66.07 | DataStreamSerializeScript |
3.67 | 272,659,024.97 | 0.3% | 31.94 | 12.71 | 2.514 | 5.01 | 0.6% | 63.38 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
2,131,937.90 | 469.06 | 0.2% | 21,935,850.89 | 7,404,122.77 | 2.963 | 3,975,866.20 | 0.7% | 66.04 | DeserializeAndCheckBlockTest |
1,516,657.38 | 659.34 | 0.3% | 16,397,963.21 | 5,259,062.71 | 3.118 | 2,929,264.57 | 0.2% | 66.02 | DeserializeBlockTest |
My PR branch with no zero-after-free allocator:
~/bitcoin$ git checkout --detach $nozero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at b5fee2fd09 refactor: Drop unused `zero_after_free_allocator`
ns/coin | coin/s | err% | ins/coin | cyc/coin | IPC | bra/coin | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
1,504.51 | 664,666.35 | 0.9% | 7,902.94 | 5,023.82 | 1.573 | 1,342.60 | 0.8% | 66.38 | CCoinsViewDBFlush |
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
0.01 | 75,642,383,126.35 | 0.3% | 0.12 | 0.05 | 2.695 | 0.02 | 0.0% | 65.00 | DataStreamAlloc |
0.05 | 18,357,256,774.43 | 0.2% | 0.91 | 0.19 | 4.800 | 0.19 | 0.0% | 66.02 | DataStreamSerializeScript |
3.65 | 273,613,395.28 | 0.0% | 31.94 | 12.70 | 2.515 | 5.01 | 0.6% | 63.30 | ProcessMessageBlock |
ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
2,127,133.96 | 470.12 | 0.4% | 21,940,668.03 | 7,392,562.32 | 2.968 | 3,976,863.37 | 0.7% | 66.20 | DeserializeAndCheckBlockTest |
1,512,064.24 | 661.35 | 0.4% | 16,399,530.70 | 5,255,082.56 | 3.121 | 2,929,551.27 | 0.2% | 65.85 | DeserializeBlockTest |
Footnotes
-
I'm not very knowledgeable about memory performance, but I suspect the Ryzen 7640U device's memory is faster for reasons beyond the "max bandwidth". (5200 MT/s for 7900x, 5600 MT/s for the 7640U) I don't have a reference for this but I believe that the 7640U has a one generation newer memory controller than the Ryzen 7900x, and anecdotally I see better performance doing LLM inference on the CPU of the 7640U, which is a memory bandwidth bound workload. ↩
This is an interesting (and expected, I suppose) takeaway. Sadly, it suggests that there's really nothing that we can do to optimize our implementation. I looked around at clang/gcc/glibc/llvm-libc to see if there are any other ways of handling this, but they all resort to the same memory barrier trick. This is definitely interesting enough to consider disabling selectively, though I'm not convinced we should just nuke it everywhere. I know you've already collected a good bit of data on this, but it's still not clear to me exactly why this is speeding up IBD. It could be, for example, that the net messages account for 90% of the memory_cleanse() calls. Would you be up for creating a callgraph/flamegraph which shows the hierarchy for these calls? |
I have compared the full IBD speed of
The added test served as a baseline, dropping the zero fee allocator as the purpose of this PR (I know you've added new commits since, let me know if you think that changes the landscape). I've used a low, but reasonable 2GB dbcache for the first 800k blocks to measure the underlying database instead of a single final dump with real nodes (which is surprisingly stable, given enough blocks). I ran it on a Hetzner HDD, the results are unfortunately not very promising: benchmarkhyperfine --runs 1 \
--export-json /mnt/ibd_full-zero_after_free_allocator_change.json \
--parameter-list COMMIT 0449a22bc0bcd610a898ec921af30175e2b34757,5cf2fefd33907c48f79e444032d79f7c889345d8 \
--prepare 'git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc) && rm -rf /mnt/BitcoinData/*' \
'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0' Benchmark 1: COMMIT=0449a22bc0bcd610a898ec921af30175e2b34757 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0
Time (abs ≡): 33806.099 s [User: 27353.365 s, System: 4255.723 s]
Benchmark 2: COMMIT=5cf2fefd33907c48f79e444032d79f7c889345d8 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0
Time (abs ≡): 33978.406 s [User: 27050.874 s, System: 4283.780 s]
Summary
COMMIT=0449a22bc0bcd610a898ec921af30175e2b34757 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0 ran
1.01 times faster than COMMIT=5cf2fefd33907c48f79e444032d79f7c889345d8 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0 so basically this change didn't seem to affect IBD speed at all. |
I reran it on the same platform until 840000 with 1GB dbcache with the latest commits. benchmarkhyperfine \
--runs 1 \
--export-json /mnt/my_storage/ibd_full-zero_after_free_allocator_change.json \
--parameter-list COMMIT 1e096b30da808d6b0691f5cde14c529e1c85ff1b,906e67b95157fd557438c37b3085cf5dec2ae135 \
--prepare 'git checkout {COMMIT} && git clean -fxd && git reset \
--hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake \
--build build -j$(nproc) && rm -rf /mnt/my_storage/BitcoinData/*' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0' I've compared these two:
results: Benchmark 1: COMMIT=1e096b30da808d6b0691f5cde14c529e1c85ff1b ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0
Time (abs ≡): 41489.685 s [User: 35294.609 s, System: 6578.215 s]
Benchmark 2: COMMIT=906e67b95157fd557438c37b3085cf5dec2ae135 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0
Time (abs ≡): 42348.064 s [User: 35077.894 s, System: 7019.189 s]
Summary
COMMIT=1e096b30da808d6b0691f5cde14c529e1c85ff1b ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0 ran
1.02 times faster than COMMIT=906e67b95157fd557438c37b3085cf5dec2ae135 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0 Which seems to indicate that there wasn't any speedup after this change. |
Concept ACK. Using zero-after-free allocators for all CDataStream usage is overkill. i don't think there are any, but if there are places where it is used (wallet) where security against key leaks is important, we could parametrize i expect this can make a significant differences on systems with relatively slow CPU or memory, like ARM systems (will run some benchmarks).
Exactly. We have Edit: Benchmarked on a Rpi5 with NVME hat, synchronizing from a node directly connected over a 1Gbit network ,
>>> (datetime.datetime.fromisoformat('2024-10-16T18:56:56') - datetime.datetime.fromisoformat('2024-10-16T08:14:13')).seconds
38563
>>> (datetime.datetime.fromisoformat('2024-10-15T21:16:02') - datetime.datetime.fromisoformat('2024-10-15T10:39:43')).seconds
38179 It is faster but only by roughly 1%. |
Given that others have been unable to reproduce the results that led me to open this PR, I am moving to draft until I better understand either what has gone wrong in my measurements or what setups reproduce the result I got. |
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
I have noticed that in debug builds the zeroing is a big part of the IBD flame graphs, but in release mode it's almost completely eliminated. |
While the performance increase likely isn't a measurable, it could still make sense to remove the memory zeroing - at least for most serializations, if for no other reason, it's doing useless work and it's easier to understand the behavior without it. |
🐙 This pull request conflicts with the target branch and needs rebase. |
⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|
This PR modifies
DataStream
's byte-vectorvch
to use the default allocatorstd::allocator
rather than thezero_after_free_allocator
which degrades performance greatly. Thezero_after_free_allocator
is identical to the defaultstd::allocator
except that it zeroes memory usingmemory_cleanse()
before deallocating.This PR also drops the
zero_after_free_allocator
, since this was only used byDataStream
andSerializeData
.In my testing (n=2) on a Raspberry Pi 5 with 4GB of memory, syncing from a fast connection to a stable dedicated node, my branch takes ~74% of the time taken by master1 to sync to height 815,000; average wall clock time was 35h 58m 40s on this branch and 48h 17m 15s on master. (See the benchmarking appendix)
I expect most of the performance improvement to come from the use of
DataStream
for allCDBWrapper
keys and values, and for all P2P messages. I suspect there are other use cases where performance is improved, but I have only tested IBD.Any objects that contains secrets should not be allocated using
zero_after_free_allocator
since they are liable to get mapped to swap space and written to disk if the user is running low on memory, and I intuit this is a likelier path than scanning unzero'd memory for an attacker to find cryptographic secrets. Secrets should be allocated usingsecure_allocator
which cleanses on deallocation andmlock()
s the memory reserved for secrets to prevent it from being mapped to swap space.Are any secrets stored in
DataStream
that will lose security?I have reviewed every appearance of
DataStream
andSerializeData
as of39219fe
and have made notes in the appendix below with notes that provide context for each instance where either is used.The only use case that I wasn't certain of is PSBT's, I believe these are never secrets, but I am not certain if there are use cases where PSBT's are worthy of being treated as secrets, and being vigilant about not writing them to disk is wise.
As I understand, most of the use of
DataStream
in the wallet code is for the reading and writing of "crypted" key and value data, and they get decrypted somewhere else in aScriptPubKeyMan
far away from anyDataStream
container, but I could also be wrong about this, or have misunderstood its use elsewhere in the wallet.Zero-after-free as a buffer overflow mitigation
The
zero_after_free
allocator was added as a buffer overflow mitigation, the idea being thatDataStream
's store a lot of unsecured data that we don't control like the UTXO set and all P2P messages, and an attacker could fill memory in a predictable way to escalate a buffer overflow into an RCE. (See Historical Background appendix).I agree completely with practicing security in depth, but I don't think this mitigation is worth the performance hit because:
I'm not a security expert and I had a hard time finding any writing anywhere that discusses this particular mitigation strategy of zeroing memory, so I hope someone with more knowledge of memory vulnerabilities can assist.
Other notes
SerializeData
asstd::vector<std::byte>
instead of deleting it and refactoring in the spots where it's used in the wallet to keep the PR small, if others think it would be better to delete it I would be happy to do it.memory_cleanse
that is causing the performance issue, but the trick we do to prevent compilers from optimizing out thememset
call is also preventing other optimizations on theDataStream
's, but I have yet to test this.SerializeData
once it loses its custom allocator.Footnotes
Master at the time of my testing was:
6d546336e800
↩