-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Basic GSO support #2532
base: main
Are you sure you want to change the base?
feat: Basic GSO support #2532
Conversation
This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see it this matters at all.
All QNS tests are failing. I see this in the logs:
|
Benchmark resultsPerformance differences relative to ddb88ac. 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.time: [462.05 ms 468.21 ms 474.36 ms] thrpt: [210.81 MiB/s 213.58 MiB/s 216.43 MiB/s] change: time: [-36.832% -35.880% -34.981%] (p = 0.00 < 0.05) thrpt: [+53.801% +55.958% +58.307%] 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💔 Performance has regressed.time: [390.77 ms 393.70 ms 396.71 ms] thrpt: [25.207 Kelem/s 25.400 Kelem/s 25.591 Kelem/s] change: time: [+10.764% +11.736% +12.692%] (p = 0.00 < 0.05) thrpt: [-11.263% -10.504% -9.7176%] 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.time: [25.877 ms 26.049 ms 26.227 ms] thrpt: [38.128 elem/s 38.390 elem/s 38.645 elem/s] change: time: [+2.0098% +2.9960% +4.0276%] (p = 0.00 < 0.05) thrpt: [-3.8716% -2.9089% -1.9702%] 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.time: [1.5963 s 1.6124 s 1.6281 s] thrpt: [61.423 MiB/s 62.021 MiB/s 62.645 MiB/s] change: time: [-13.392% -11.990% -10.582%] (p = 0.00 < 0.05) thrpt: [+11.834% +13.623% +15.463%] decode 4096 bytes, mask ff: No change in performance detected.time: [12.069 µs 12.096 µs 12.132 µs] change: [-0.4242% -0.0738% +0.2922%] (p = 0.69 > 0.05) decode 1048576 bytes, mask ff: No change in performance detected.time: [3.0765 ms 3.0854 ms 3.0951 ms] change: [-0.5653% -0.1081% +0.3419%] (p = 0.66 > 0.05) decode 4096 bytes, mask 7f: No change in performance detected.time: [20.148 µs 20.195 µs 20.250 µs] change: [-0.4185% +0.0984% +0.5604%] (p = 0.71 > 0.05) decode 1048576 bytes, mask 7f: No change in performance detected.time: [5.2578 ms 5.2710 ms 5.2858 ms] change: [-0.3092% +0.0578% +0.4232%] (p = 0.76 > 0.05) decode 4096 bytes, mask 3f: No change in performance detected.time: [7.0230 µs 7.0560 µs 7.0945 µs] change: [-0.8166% -0.1451% +0.4459%] (p = 0.66 > 0.05) decode 1048576 bytes, mask 3f: No change in performance detected.time: [1.7922 ms 1.7979 ms 1.8049 ms] change: [-0.6514% -0.1113% +0.4253%] (p = 0.68 > 0.05) 1 streams of 1 bytes/multistream: No change in performance detected.time: [72.739 µs 72.948 µs 73.161 µs] change: [-0.4875% -0.0172% +0.4598%] (p = 0.94 > 0.05) 1000 streams of 1 bytes/multistream: No change in performance detected.time: [24.805 ms 24.843 ms 24.883 ms] change: [-0.1240% +0.0865% +0.3002%] (p = 0.43 > 0.05) 10000 streams of 1 bytes/multistream: No change in performance detected.time: [1.6888 s 1.6907 s 1.6925 s] change: [-0.0431% +0.1011% +0.2525%] (p = 0.17 > 0.05) 1 streams of 1000 bytes/multistream: No change in performance detected.time: [74.407 µs 74.664 µs 74.922 µs] change: [-1.8086% -0.3011% +0.6919%] (p = 0.73 > 0.05) 100 streams of 1000 bytes/multistream: No change in performance detected.time: [3.3915 ms 3.3973 ms 3.4036 ms] change: [-0.2584% +0.0155% +0.2893%] (p = 0.91 > 0.05) 1000 streams of 1000 bytes/multistream: No change in performance detected.time: [144.57 ms 144.65 ms 144.73 ms] change: [-0.0401% +0.0436% +0.1259%] (p = 0.30 > 0.05) coalesce_acked_from_zero 1+1 entries: No change in performance detected.time: [94.494 ns 94.751 ns 95.009 ns] change: [-0.4734% +0.0743% +0.7168%] (p = 0.83 > 0.05) coalesce_acked_from_zero 3+1 entries: No change in performance detected.time: [112.53 ns 112.89 ns 113.30 ns] change: [-0.3333% +0.1069% +0.5137%] (p = 0.63 > 0.05) coalesce_acked_from_zero 10+1 entries: No change in performance detected.time: [111.97 ns 112.35 ns 112.82 ns] change: [-0.3878% +0.0474% +0.5106%] (p = 0.85 > 0.05) coalesce_acked_from_zero 1000+1 entries: No change in performance detected.time: [92.816 ns 93.232 ns 93.667 ns] change: [-1.2989% -0.2400% +0.8431%] (p = 0.66 > 0.05) RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [115.95 ms 116.02 ms 116.08 ms] change: [+0.5935% +0.6622% +0.7327%] (p = 0.00 < 0.05) SentPackets::take_ranges: No change in performance detected.time: [8.4318 µs 8.6367 µs 8.8225 µs] change: [-2.7836% -0.2740% +2.1613%] (p = 0.82 > 0.05) transfer/pacing-false/varying-seeds: No change in performance detected.time: [35.878 ms 35.944 ms 36.010 ms] change: [-0.2768% -0.0286% +0.2290%] (p = 0.82 > 0.05) transfer/pacing-true/varying-seeds: Change within noise threshold.time: [35.884 ms 35.942 ms 35.999 ms] change: [-1.0147% -0.7745% -0.5382%] (p = 0.00 < 0.05) transfer/pacing-false/same-seed: Change within noise threshold.time: [36.009 ms 36.075 ms 36.141 ms] change: [+0.9178% +1.1673% +1.4154%] (p = 0.00 < 0.05) transfer/pacing-true/same-seed: Change within noise threshold.time: [36.116 ms 36.166 ms 36.216 ms] change: [+1.6688% +1.8491% +2.0443%] (p = 0.00 < 0.05) Client/server transfer resultsPerformance differences relative to ddb88ac. Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.
|
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to 9354a53. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
09e04c6
to
65cdb23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early benchmarks look promising. That said, I am not sure whether we will see similar improvements when benchmarked through Firefox with connection latency and bandwidth limit.
As discussed out-of-band, I would favor a more integrated implementation, moving all batching logic into neqo-transport::Connection
. Connection
can be more efficient at batching, having access to all known information of the connection, and being able to allocate all batcheable datagrams at once. In addition, this would allow a single batching implementation, then used by neqo-client
, neqo-server
, mozilla-central/http3server
and lastly of course Firefox.
For others, past draft of the above mentioned integrated implementation: f25b0b7
@larseggert what are the next steps? I would suggest applying the same non-integrated optimization to neqo_glue/src/lib.rs
. You can easily use a custom neqo-*
version through a mozilla/central/Cargo.toml
override. We can then either test Firefox upload speed against a local HTTP3 server, or using Andrew's upload automation (MacOS) for more reproducible results, using a real-world connection to Google's infrastructure instead of a localhost setup.
/// When a datagram is pushed that does not match the meta data of the batch, | ||
/// it is stored in `next` and a send indication is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mechanism of next
is not intuitive to me. Why doesn't push
simply return the Datagram
when it doesn't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because then the different users of the type would need to implement their own method for storing it and switching to it. I thought it would be simpler if the type handled that for the caller.
I have started to do a version of this in the glue code. It's a bit challenging because the current mainline of neqo has picked up a bunch of dependencies beyond that of Firefox, and I need to figure out how to upgrade those... Am wondering if we should cut a neqo release soon before there is more divergence. |
Signed-off-by: Lars Eggert <[email protected]>
I was planning to cut a new release once #2492 is merged. @larseggert I am happy to cut a new release beforehand if you like. |
This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see if this matters at all.