-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use FxHasher
in places where we don't need DDoS resistance
#2342
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2342 +/- ##
==========================================
- Coverage 95.39% 95.38% -0.02%
==========================================
Files 115 115
Lines 36982 36982
Branches 36982 36982
==========================================
- Hits 35280 35275 -5
- Misses 1696 1701 +5
Partials 6 6
|
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to 9354a53. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark resultsPerformance differences relative to 9354a53. 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: No change in performance detected.time: [711.23 ms 715.93 ms 720.61 ms] thrpt: [138.77 MiB/s 139.68 MiB/s 140.60 MiB/s] change: time: [-0.8047% +0.0411% +0.9404%] (p = 0.93 > 0.05) thrpt: [-0.9317% -0.0411% +0.8113%] 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💚 Performance has improved.time: [337.02 ms 338.56 ms 340.12 ms] thrpt: [29.401 Kelem/s 29.537 Kelem/s 29.672 Kelem/s] change: time: [-3.5830% -2.9267% -2.2806%] (p = 0.00 < 0.05) thrpt: [+2.3339% +3.0150% +3.7162%] 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.time: [24.865 ms 25.018 ms 25.177 ms] thrpt: [39.718 elem/s 39.971 elem/s 40.217 elem/s] change: time: [-1.1118% -0.2774% +0.5865%] (p = 0.53 > 0.05) thrpt: [-0.5831% +0.2782% +1.1243%] 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.time: [1.7840 s 1.8013 s 1.8194 s] thrpt: [54.964 MiB/s 55.516 MiB/s 56.055 MiB/s] change: time: [-8.0295% -6.7456% -5.4968%] (p = 0.00 < 0.05) thrpt: [+5.8165% +7.2336% +8.7306%] decode 4096 bytes, mask ff: No change in performance detected.time: [12.002 µs 12.040 µs 12.085 µs] change: [-0.5094% +0.0666% +0.7108%] (p = 0.83 > 0.05) decode 1048576 bytes, mask ff: 💔 Performance has regressed.time: [3.0621 ms 3.0733 ms 3.0850 ms] change: [+3.7613% +4.1974% +4.6522%] (p = 0.00 < 0.05) decode 4096 bytes, mask 7f: No change in performance detected.time: [19.984 µs 20.036 µs 20.092 µs] change: [-1.0076% -0.4513% +0.0892%] (p = 0.11 > 0.05) decode 1048576 bytes, mask 7f: 💔 Performance has regressed.time: [5.4037 ms 5.4154 ms 5.4287 ms] change: [+12.348% +12.771% +13.178%] (p = 0.00 < 0.05) decode 4096 bytes, mask 3f: No change in performance detected.time: [6.3198 µs 6.3509 µs 6.3874 µs] change: [-0.1521% +0.2710% +0.7006%] (p = 0.23 > 0.05) decode 1048576 bytes, mask 3f: 💔 Performance has regressed.time: [2.5082 ms 2.5154 ms 2.5239 ms] change: [+16.182% +16.693% +17.214%] (p = 0.00 < 0.05) 1 streams of 1 bytes/multistream: 💔 Performance has regressed.time: [72.651 µs 73.318 µs 74.450 µs] change: [+1.4552% +2.5496% +4.5347%] (p = 0.00 < 0.05) 1000 streams of 1 bytes/multistream: 💚 Performance has improved.time: [24.541 ms 24.578 ms 24.616 ms] change: [-3.2874% -3.0945% -2.9085%] (p = 0.00 < 0.05) 10000 streams of 1 bytes/multistream: Change within noise threshold.time: [1.6909 s 1.6928 s 1.6948 s] change: [-0.6298% -0.4783% -0.3287%] (p = 0.00 < 0.05) 1 streams of 1000 bytes/multistream: 💔 Performance has regressed.time: [74.085 µs 74.302 µs 74.524 µs] change: [+1.7700% +2.2518% +2.7152%] (p = 0.00 < 0.05) 100 streams of 1000 bytes/multistream: 💚 Performance has improved.time: [3.3076 ms 3.3143 ms 3.3213 ms] change: [-1.6070% -1.3071% -1.0227%] (p = 0.00 < 0.05) 1000 streams of 1000 bytes/multistream: Change within noise threshold.time: [142.70 ms 142.79 ms 142.87 ms] change: [-0.4131% -0.3376% -0.2601%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1+1 entries: No change in performance detected.time: [94.706 ns 95.025 ns 95.354 ns] change: [-0.8124% -0.2059% +0.3540%] (p = 0.50 > 0.05) coalesce_acked_from_zero 3+1 entries: Change within noise threshold.time: [112.48 ns 112.76 ns 113.07 ns] change: [-0.9082% -0.4647% -0.0236%] (p = 0.05 < 0.05) coalesce_acked_from_zero 10+1 entries: No change in performance detected.time: [111.97 ns 112.37 ns 112.86 ns] change: [-0.9055% -0.1878% +0.8513%] (p = 0.72 > 0.05) coalesce_acked_from_zero 1000+1 entries: No change in performance detected.time: [93.209 ns 93.820 ns 94.642 ns] change: [-1.8678% -0.5898% +0.8762%] (p = 0.41 > 0.05) RxStreamOrderer::inbound_frame(): 💚 Performance has improved.time: [115.27 ms 115.34 ms 115.41 ms] change: [-1.1736% -1.0958% -1.0206%] (p = 0.00 < 0.05) SentPackets::take_ranges: No change in performance detected.time: [8.3832 µs 8.6045 µs 8.8119 µs] change: [+0.1215% +3.3557% +6.9967%] (p = 0.05 > 0.05) transfer/pacing-false/varying-seeds: Change within noise threshold.time: [35.740 ms 35.805 ms 35.869 ms] change: [+0.1117% +0.3951% +0.6609%] (p = 0.01 < 0.05) transfer/pacing-true/varying-seeds: Change within noise threshold.time: [35.696 ms 35.753 ms 35.809 ms] change: [-1.6784% -1.4634% -1.2540%] (p = 0.00 < 0.05) transfer/pacing-false/same-seed: Change within noise threshold.time: [35.586 ms 35.644 ms 35.702 ms] change: [-1.0278% -0.7855% -0.5564%] (p = 0.00 < 0.05) transfer/pacing-true/same-seed: Change within noise threshold.time: [35.924 ms 35.970 ms 36.018 ms] change: [-1.5175% -1.3530% -1.1751%] (p = 0.00 < 0.05) Client/server transfer resultsPerformance differences relative to 9354a53. Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 in general.
That said, I would prefer only replacing std::collectionsHash*
where ever it proofs beneficial, e.g. not in unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spotted two places where EnumMap
could give us bigger wins.
I think that your performance gains largely derive from the changes to the client and server code. There, the security risk is limited (we're not using this server in real deployments).
Still, you should review the changes for security risk. This hasher could expose us to DoS if the hashed values are controlled by an adversary. I've checked the usage in our server code, which is fine because attackers don't get to control memory allocations (we use pointer values for the hash). Still, that makes me wonder whether we should be using Pin
.
@martinthomson thanks for the analysis. My plan is to add some benches first in another PR. I'll add some for those instances where you suggest to look into Even if some of the macro benefits come from speeding up the demo client and server code, it's IMO still worth doing, since eliminating those overheads makes it easier to spot other bottlenecks. About security, I didn't do much of an analysis, but I think the main use of this insecure hasher would be when looking up items (streams, unacked chunks) that while under the control of an attacker are also quite limited in what valid values are that wouldn't immediately cause a connection clause. |
I definitely agree with the point about removing the overheads from our toy code as much as possible. This seems like a pretty substantial win there, so it's worth doing. I doubt that my |
I think the |
FxHasher
in places where we don't need DDoS resistance
Good point. Though before we introduce the complexity of |
Definitely the right question to be asking. I think that it might be possible to use the first connection ID as a key for this sort of thing, but we don't tend to keep that around today, once we stop using it. Everything else -- as far as I know -- is ephemeral and therefore not suitable. |
I'm doing a benchmark in #2444 to quantify the benefits first. (It's not going well, a lot of variation run-to-run for some reason.) |
That is counterintuitive for me, given that it uses |
I wonder if it's the CPU scheduler and frequency control on my Mac. Bencher seems much more stable. |
For what it is worth, here is #2444 on my machine:
I don't see much deviation. Am I running the wrong version @larseggert? |
Can you run it again and see if there are changes run to run? That is where I see random improvements or regressions. |
Here are two more runs with vanilla #2444. No significant deviations. Note that I am not running your optimizations in this pull request.
|
Oh good. I think it is core pinning being awkward on macOS then. BTW, I came across https://manuel.bernhardt.io/posts/2023-11-16-core-pinning/ today, and we should change the bencher accordingly. |
2b00ef1
to
60cb6f2
Compare
I'm redoing this PR in stages, to check if the new bench actually shows any improvements. The first push changes only the existing (non-test) uses of |
Hm. The benches all show small improvements, while the client/server tests all show small regressions... |
Signed-off-by: Lars Eggert <[email protected]>
Signed-off-by: Lars Eggert <[email protected]>
Signed-off-by: Lars Eggert <[email protected]>
I think this may be worthwhile. The
cargo bench
es don't consistently show a benefit, but the loopback transfers on the bencher machine are faster, e.g.,without this PR but
with it.
(I'll see if I can improve CI so that we also see the differences to
main
for the table results.)