feat: Basic GSO support #2532

larseggert · 2025-03-27T13:29:16Z

This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see if this matters at all.

This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see it this matters at all.

larseggert · 2025-03-27T14:15:58Z

All QNS tests are failing. I see this in the logs:

server  | 1.021 INFO `libc::sendmsg` failed with Input/output error (os error 5); halting segmentation offload
server  | Error: IoError(Os { code: 5, kind: Uncategorized, message: "Input/output error" })

github-actions · 2025-03-27T14:32:54Z

Benchmark results

Performance differences relative to ddb88ac.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.

       time:   [462.05 ms 468.21 ms 474.36 ms]
       thrpt:  [210.81 MiB/s 213.58 MiB/s 216.43 MiB/s]
change:
       time:   [-36.832% -35.880% -34.981%] (p = 0.00 < 0.05)
       thrpt:  [+53.801% +55.958% +58.307%]
Found 2 outliers among 100 measurements (2.00%)

2 (2.00%) high mild

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💔 Performance has regressed.

       time:   [390.77 ms 393.70 ms 396.71 ms]
       thrpt:  [25.207 Kelem/s 25.400 Kelem/s 25.591 Kelem/s]
change:
       time:   [+10.764% +11.736% +12.692%] (p = 0.00 < 0.05)
       thrpt:  [-11.263% -10.504% -9.7176%]
Found 2 outliers among 100 measurements (2.00%)

2 (2.00%) high mild

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.

       time:   [25.877 ms 26.049 ms 26.227 ms]
       thrpt:  [38.128  elem/s 38.390  elem/s 38.645  elem/s]
change:
       time:   [+2.0098% +2.9960% +4.0276%] (p = 0.00 < 0.05)
       thrpt:  [-3.8716% -2.9089% -1.9702%]
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high mild

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.

       time:   [1.5963 s 1.6124 s 1.6281 s]
       thrpt:  [61.423 MiB/s 62.021 MiB/s 62.645 MiB/s]
change:
       time:   [-13.392% -11.990% -10.582%] (p = 0.00 < 0.05)
       thrpt:  [+11.834% +13.623% +15.463%]

decode 4096 bytes, mask ff: No change in performance detected.

       time:   [12.069 µs 12.096 µs 12.132 µs]
       change: [-0.4242% -0.0738% +0.2922%] (p = 0.69 > 0.05)
Found 10 outliers among 100 measurements (10.00%)

2 (2.00%) low severe

2 (2.00%) low mild

2 (2.00%) high mild

4 (4.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.

       time:   [3.0765 ms 3.0854 ms 3.0951 ms]
       change: [-0.5653% -0.1081% +0.3419%] (p = 0.66 > 0.05)
Found 9 outliers among 100 measurements (9.00%)

1 (1.00%) low mild

1 (1.00%) high mild

7 (7.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.

       time:   [20.148 µs 20.195 µs 20.250 µs]
       change: [-0.4185% +0.0984% +0.5604%] (p = 0.71 > 0.05)
Found 18 outliers among 100 measurements (18.00%)

1 (1.00%) low severe

3 (3.00%) low mild

14 (14.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.

       time:   [5.2578 ms 5.2710 ms 5.2858 ms]
       change: [-0.3092% +0.0578% +0.4232%] (p = 0.76 > 0.05)
Found 17 outliers among 100 measurements (17.00%)

17 (17.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.

       time:   [7.0230 µs 7.0560 µs 7.0945 µs]
       change: [-0.8166% -0.1451% +0.4459%] (p = 0.66 > 0.05)
Found 17 outliers among 100 measurements (17.00%)

1 (1.00%) low severe

2 (2.00%) low mild

3 (3.00%) high mild

11 (11.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.

       time:   [1.7922 ms 1.7979 ms 1.8049 ms]
       change: [-0.6514% -0.1113% +0.4253%] (p = 0.68 > 0.05)
Found 6 outliers among 100 measurements (6.00%)

6 (6.00%) high severe

1 streams of 1 bytes/multistream: No change in performance detected.

       time:   [72.739 µs 72.948 µs 73.161 µs]
       change: [-0.4875% -0.0172% +0.4598%] (p = 0.94 > 0.05)
Found 2 outliers among 100 measurements (2.00%)

2 (2.00%) high mild

1000 streams of 1 bytes/multistream: No change in performance detected.

       time:   [24.805 ms 24.843 ms 24.883 ms]
       change: [-0.1240% +0.0865% +0.3002%] (p = 0.43 > 0.05)

10000 streams of 1 bytes/multistream: No change in performance detected.

       time:   [1.6888 s 1.6907 s 1.6925 s]
       change: [-0.0431% +0.1011% +0.2525%] (p = 0.17 > 0.05)
Found 10 outliers among 100 measurements (10.00%)

2 (2.00%) low mild

8 (8.00%) high mild

1 streams of 1000 bytes/multistream: No change in performance detected.

       time:   [74.407 µs 74.664 µs 74.922 µs]
       change: [-1.8086% -0.3011% +0.6919%] (p = 0.73 > 0.05)
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high mild

100 streams of 1000 bytes/multistream: No change in performance detected.

       time:   [3.3915 ms 3.3973 ms 3.4036 ms]
       change: [-0.2584% +0.0155% +0.2893%] (p = 0.91 > 0.05)
Found 16 outliers among 100 measurements (16.00%)

16 (16.00%) high severe

1000 streams of 1000 bytes/multistream: No change in performance detected.

       time:   [144.57 ms 144.65 ms 144.73 ms]
       change: [-0.0401% +0.0436% +0.1259%] (p = 0.30 > 0.05)
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high mild

coalesce_acked_from_zero 1+1 entries: No change in performance detected.

       time:   [94.494 ns 94.751 ns 95.009 ns]
       change: [-0.4734% +0.0743% +0.7168%] (p = 0.83 > 0.05)
Found 6 outliers among 100 measurements (6.00%)

2 (2.00%) high mild

4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.

       time:   [112.53 ns 112.89 ns 113.30 ns]
       change: [-0.3333% +0.1069% +0.5137%] (p = 0.63 > 0.05)
Found 9 outliers among 100 measurements (9.00%)

1 (1.00%) high mild

8 (8.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.

       time:   [111.97 ns 112.35 ns 112.82 ns]
       change: [-0.3878% +0.0474% +0.5106%] (p = 0.85 > 0.05)
Found 22 outliers among 100 measurements (22.00%)

5 (5.00%) low severe

3 (3.00%) low mild

4 (4.00%) high mild

10 (10.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.

       time:   [92.816 ns 93.232 ns 93.667 ns]
       change: [-1.2989% -0.2400% +0.8431%] (p = 0.66 > 0.05)
Found 6 outliers among 100 measurements (6.00%)

3 (3.00%) high mild

3 (3.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.

       time:   [115.95 ms 116.02 ms 116.08 ms]
       change: [+0.5935% +0.6622% +0.7327%] (p = 0.00 < 0.05)
Found 21 outliers among 100 measurements (21.00%)

8 (8.00%) low severe

1 (1.00%) high mild

12 (12.00%) high severe

SentPackets::take_ranges: No change in performance detected.

       time:   [8.4318 µs 8.6367 µs 8.8225 µs]
       change: [-2.7836% -0.2740% +2.1613%] (p = 0.82 > 0.05)
Found 17 outliers among 100 measurements (17.00%)

4 (4.00%) low severe

11 (11.00%) low mild

2 (2.00%) high mild

transfer/pacing-false/varying-seeds: No change in performance detected.

       time:   [35.878 ms 35.944 ms 36.010 ms]
       change: [-0.2768% -0.0286% +0.2290%] (p = 0.82 > 0.05)

transfer/pacing-true/varying-seeds: Change within noise threshold.

       time:   [35.884 ms 35.942 ms 35.999 ms]
       change: [-1.0147% -0.7745% -0.5382%] (p = 0.00 < 0.05)

transfer/pacing-false/same-seed: Change within noise threshold.

       time:   [36.009 ms 36.075 ms 36.141 ms]
       change: [+0.9178% +1.1673% +1.4154%] (p = 0.00 < 0.05)

transfer/pacing-true/same-seed: Change within noise threshold.

       time:   [36.116 ms 36.166 ms 36.216 ms]
       change: [+1.6688% +1.8491% +2.0443%] (p = 0.00 < 0.05)

Client/server transfer results

Performance differences relative to ddb88ac.

Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.

Client	Server	CC	Pacing	Mean ± σ	Min	Max	MiB/s ± σ	Δ `main`	Δ `main`
neqo	neqo	reno	on	248.5 ± 36.3	214.8	392.7	128.8 ± 0.9	💚 -166.1	-40.1%
neqo	neqo	reno		308.7 ± 171.1	208.0	854.5	103.6 ± 0.2	💚 -173.4	-36.0%
neqo	neqo	cubic	on	250.2 ± 42.9	204.8	394.7	127.9 ± 0.7	💚 -161.6	-39.3%
neqo	neqo	cubic		242.4 ± 32.8	209.5	386.1	132.0 ± 1.0	💚 -169.3	-41.1%
google	neqo	reno	on	759.3 ± 124.5	513.7	995.0	42.1 ± 0.3	-10.9	-1.4%
google	neqo	reno		761.0 ± 131.6	497.0	1013.5	42.0 ± 0.2	-9.3	-1.2%
google	neqo	cubic	on	842.6 ± 165.0	561.3	1008.7	38.0 ± 0.2	💔 81.9	10.8%
google	neqo	cubic		853.8 ± 177.1	558.2	1207.1	37.5 ± 0.2	💔 93.3	12.3%
google	google			577.3 ± 23.1	550.4	646.7	55.4 ± 1.4	7.4	1.3%
neqo	msquic	reno	on	280.0 ± 44.8	247.1	432.9	114.3 ± 0.7	12.6	4.7%
neqo	msquic	reno		280.7 ± 45.0	245.7	424.6	114.0 ± 0.7	10.9	4.1%
neqo	msquic	cubic	on	270.1 ± 31.4	222.1	408.0	118.5 ± 1.0	5.1	1.9%
neqo	msquic	cubic		265.9 ± 18.3	245.7	319.3	120.3 ± 1.7	-1.6	-0.6%
msquic	msquic			178.3 ± 27.6	147.4	276.2	179.4 ± 1.2	0.5	0.3%

⬇️ Download logs

github-actions · 2025-03-28T07:24:42Z

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 9354a53.

neqo-latest as client

neqo-latest vs. aioquic: Z
neqo-latest vs. go-x-net: BP BA
neqo-latest vs. haproxy: BP BA
neqo-latest vs. kwik: run cancelled after 20 min
neqo-latest vs. linuxquic: C1
neqo-latest vs. lsquic: L1 C1
neqo-latest vs. msquic: Z A L1 C1
neqo-latest vs. mvfst: A L1 C1 🚀BA
neqo-latest vs. nginx: BP BA
neqo-latest vs. ngtcp2: CM
neqo-latest vs. picoquic: run cancelled after 20 min
neqo-latest vs. quic-go: A
neqo-latest vs. quiche: BP BA
neqo-latest vs. quinn: ⚠️L1
neqo-latest vs. s2n-quic: BP BA CM
neqo-latest vs. tquic: S BP BA
neqo-latest vs. xquic: run cancelled after 20 min

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: H DC LR C20 M S R 3 B U A L1 L2 C1 C2 6 V2 BP BA
neqo-latest vs. go-x-net: H DC LR M B U A L2 C2 6
neqo-latest vs. haproxy: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
neqo-latest vs. linuxquic: H DC LR C20 M S R Z 3 B U E A L1 L2 C2 6 V2 BP BA CM
neqo-latest vs. lsquic: H DC LR C20 M S R Z 3 B U E A L2 C2 6 V2 BP BA
neqo-latest vs. msquic: H DC LR C20 M S R B U L2 C2 6 V2 BP BA
neqo-latest vs. mvfst: H DC LR M R Z 3 B U L2 C2 6 BP 🚀BA
neqo-latest vs. neqo: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2 BP BA CM
neqo-latest vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2 BP BA CM
neqo-latest vs. nginx: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
neqo-latest vs. ngtcp2: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2 BP BA
neqo-latest vs. quic-go: H DC LR C20 M S R Z 3 B U L1 L2 C1 C2 6 BP BA
neqo-latest vs. quiche: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6
neqo-latest vs. quinn: H DC LR C20 M S R Z 3 B U E A ⚠️L1 L2 C1 C2 6 BP BA
neqo-latest vs. s2n-quic: H DC LR C20 M S R 3 B U E A L1 L2 C1 C2 6
neqo-latest vs. tquic: H DC LR C20 M R Z 3 B U A L1 L2 C1 C2 6

neqo-latest as server

aioquic vs. neqo-latest: H DC LR C20 M S R Z 3 B A L1 L2 ⚠️C1 C2 6 V2 BP BA
chrome vs. neqo-latest: 3
go-x-net vs. neqo-latest: H DC LR M B U A L2 C2 6 BP BA
kwik vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2
linuxquic vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2 BP BA CM
lsquic vs. neqo-latest: H DC LR M S R 3 B E A L1 L2 C1 C2 6 V2 BP BA
msquic vs. neqo-latest: H DC LR C20 M S R B A L1 L2 C1 C2 6 V2 BP BA
mvfst vs. neqo-latest: H DC LR M 3 B L2 C2 6 BP BA
neqo vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2 BP BA CM
ngtcp2 vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2 BP BA CM
openssl vs. neqo-latest: H DC C20 S R 3 B A L2 C2 6 BP BA
picoquic vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 V2 BP BA CM
quic-go vs. neqo-latest: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 BP BA
quiche vs. neqo-latest: H DC LR M S R Z 3 B A L1 L2 C1 C2 6 BP BA
quinn vs. neqo-latest: H DC LR C20 M S R Z 3 B U E A L1 L2 C1 C2 6 BP BA
s2n-quic vs. neqo-latest: H DC LR M S R 3 B E A L1 L2 C1 C2 6 BP BA
tquic vs. neqo-latest: H DC LR M S R Z 3 B A L1 L2 C1 C2 6 BP BA
xquic vs. neqo-latest: H DC LR C20 S R Z 3 B U A L1 L2 C1 C2 6 BP BA

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: E CM
neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2 CM
neqo-latest vs. haproxy: E CM
neqo-latest vs. lsquic: CM
neqo-latest vs. msquic: 3 E CM
neqo-latest vs. mvfst: C20 S E V2 CM
neqo-latest vs. nginx: E V2 CM
neqo-latest vs. quic-go: E V2 CM
neqo-latest vs. quiche: E V2 CM
neqo-latest vs. quinn: V2 CM
neqo-latest vs. s2n-quic: Z V2
neqo-latest vs. tquic: E V2 CM

neqo-latest as server

aioquic vs. neqo-latest: U E
chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2 BP BA CM
go-x-net vs. neqo-latest: C20 S R Z 3 E L1 C1 V2
kwik vs. neqo-latest: E
lsquic vs. neqo-latest: C20 Z U
msquic vs. neqo-latest: 3 E
mvfst vs. neqo-latest: C20 S R U E V2
openssl vs. neqo-latest: Z U E L1 C1 V2
quic-go vs. neqo-latest: E V2
quiche vs. neqo-latest: C20 U E V2
s2n-quic vs. neqo-latest: C20 Z U V2
tquic vs. neqo-latest: C20 U E V2
xquic vs. neqo-latest: E V2

mxinden

Early benchmarks look promising. That said, I am not sure whether we will see similar improvements when benchmarked through Firefox with connection latency and bandwidth limit.

As discussed out-of-band, I would favor a more integrated implementation, moving all batching logic into neqo-transport::Connection. Connection can be more efficient at batching, having access to all known information of the connection, and being able to allocate all batcheable datagrams at once. In addition, this would allow a single batching implementation, then used by neqo-client, neqo-server, mozilla-central/http3server and lastly of course Firefox.

For others, past draft of the above mentioned integrated implementation: f25b0b7

@larseggert what are the next steps? I would suggest applying the same non-integrated optimization to neqo_glue/src/lib.rs. You can easily use a custom neqo-* version through a mozilla/central/Cargo.toml override. We can then either test Firefox upload speed against a local HTTP3 server, or using Andrew's upload automation (MacOS) for more reproducible results, using a real-world connection to Google's infrastructure instead of a localhost setup.

neqo-udp/src/lib.rs

mxinden · 2025-03-31T13:17:45Z

neqo-udp/src/lib.rs

+/// When a datagram is pushed that does not match the meta data of the batch,
+/// it is stored in `next` and a send indication is returned.


The mechanism of next is not intuitive to me. Why doesn't push simply return the Datagram when it doesn't match?

Because then the different users of the type would need to implement their own method for storing it and switching to it. I thought it would be simpler if the type handled that for the caller.

neqo-udp/src/lib.rs

larseggert · 2025-03-31T14:19:16Z

I have started to do a version of this in the glue code. It's a bit challenging because the current mainline of neqo has picked up a bunch of dependencies beyond that of Firefox, and I need to figure out how to upgrade those...

Am wondering if we should cut a neqo release soon before there is more divergence.

Signed-off-by: Lars Eggert <[email protected]>

mxinden · 2025-03-31T19:25:09Z

Am wondering if we should cut a neqo release soon before there is more divergence.

I was planning to cut a new release once #2492 is merged. @larseggert I am happy to cut a new release beforehand if you like.

Signed-off-by: Lars Eggert <[email protected]>

larseggert added 3 commits March 27, 2025 15:27

feat: Basic GSO support

7a5e5a4

This simply collects batches of same-size, same-marked datagrams to the same destination together by copying. In essence, we trade more memory copies for fewer system calls. Let's see it this matters at all.

Fixes

04bd127

Merge branch 'main' into feat-sendmmsg

67f667b

larseggert added 8 commits March 27, 2025 16:48

Fix QNS?

ce81d81

DatagramMetaData

eb9851c

Optimize

1e9beac

Fixes

d48f25f

More

aaecebd

Again

b44d9bd

Again

00a0557

Fix QNS?

65cdb23

larseggert force-pushed the feat-sendmmsg branch from 09e04c6 to 65cdb23 Compare March 28, 2025 13:36

Merge branch 'main' into feat-sendmmsg

a16e66f

larseggert marked this pull request as ready for review March 28, 2025 15:23

larseggert requested review from KershawChang, martinthomson and mxinden as code owners March 28, 2025 15:23

larseggert added 4 commits March 31, 2025 08:38

Merge branch 'main' into feat-sendmmsg

0f85c3c

Refactor

9ff9a0c

smallvec

a049257

Tweaks

e7bc1a3

mxinden reviewed Mar 31, 2025

View reviewed changes

larseggert added 3 commits March 31, 2025 18:57

Merge branch 'main' into feat-sendmmsg

4eef432

Signed-off-by: Lars Eggert <[email protected]>

No SmallVec

7f4221a

Minimize

879de59

larseggert added 10 commits April 1, 2025 08:15

PartialEq

55550eb

DatagramMetaData

4b070c3

Finalize

f6cdd02

SmallVec again

e522c01

Merge branch 'main' into feat-sendmmsg

263934b

Signed-off-by: Lars Eggert <[email protected]>

Simplify

c9c8ec8

Minimize more

416e3d1

clippy

43cd3a1

Again

07a8376

clippy

3c4bbd0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Basic GSO support #2532

feat: Basic GSO support #2532

larseggert commented Mar 27, 2025 •

edited

Loading

larseggert commented Mar 27, 2025

github-actions bot commented Mar 27, 2025 •

edited

Loading

github-actions bot commented Mar 28, 2025 •

edited

Loading

Succeeded Interop Tests

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

neqo-latest as client

neqo-latest as server

mxinden left a comment •

edited

Loading

mxinden Mar 31, 2025

larseggert Mar 31, 2025

larseggert commented Mar 31, 2025 •

edited

Loading

mxinden commented Mar 31, 2025

		/// When a datagram is pushed that does not match the meta data of the batch,
		/// it is stored in `next` and a send indication is returned.

feat: Basic GSO support #2532

Are you sure you want to change the base?

feat: Basic GSO support #2532

Conversation

larseggert commented Mar 27, 2025 • edited Loading

larseggert commented Mar 27, 2025

github-actions bot commented Mar 27, 2025 • edited Loading

Benchmark results

Client/server transfer results

github-actions bot commented Mar 28, 2025 • edited Loading

Failed Interop Tests

neqo-latest as client

neqo-latest as server

Succeeded Interop Tests

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

neqo-latest as client

neqo-latest as server

mxinden left a comment • edited Loading

Choose a reason for hiding this comment

mxinden Mar 31, 2025

Choose a reason for hiding this comment

larseggert Mar 31, 2025

Choose a reason for hiding this comment

larseggert commented Mar 31, 2025 • edited Loading

mxinden commented Mar 31, 2025

larseggert commented Mar 27, 2025 •

edited

Loading

github-actions bot commented Mar 27, 2025 •

edited

Loading

github-actions bot commented Mar 28, 2025 •

edited

Loading

mxinden left a comment •

edited

Loading

larseggert commented Mar 31, 2025 •

edited

Loading