fix(iroh): always ping new paths, add test for reconnecting with changed addrs #3372

Frando · 2025-06-27T08:13:57Z

Description

Adds a test that closes a client endpoint and recreates it right away with the same node id. It tries to connect to a server it was previously connected to to. The test fails on main.

The reason is that the server thinks the client's previous UDP send address is still valid. Even though it receives both disco pings and quic packets from the new address, it doesn't update the best_addr used for sending, therefore the server's QUIC replies are sent into nirvana, and the test times out.

The second commit fixes this behavior by always sending disco pings to new paths.

Breaking Changes

Notes & open questions

Change checklist

github-actions · 2025-06-27T08:16:38Z

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3372/docs/iroh/

Last updated: 2025-06-30T09:24:39Z

github-actions · 2025-06-27T08:20:16Z

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 38380b8

flub

Yes, this is a good fix! I can't think of any reason why this extra ping back would ever do much harm (famous last words).

My only concern is the 5s is asking for flakyness, but I can't really think of something better right away.

iroh/src/endpoint.rs

flub · 2025-06-27T09:53:21Z

iroh/src/endpoint.rs

+        .e()??;
+        assert_eq!(rx.recv().await.unwrap().unwrap(), 23);
+        // close the endpoint in a separate task, to not lose time for our immediate respawn testing
+        let close1 = tokio::task::spawn(async move { ep.close().await });


How long does this take? Since both endpoints are still there this should be pretty fast, no? Feels a bit odd to spawn this.

Yea, I'll remove it again. It was a trial for the windows failure, but that's not it, see comment below

flub · 2025-06-27T10:22:17Z

My only concern is the 5s is asking for flakyness, but I can't really think of something better right away.

Oh, lol. Windows CI already fails because of this... Guess we'll have to figure this out right away.

flub · 2025-06-27T10:23:30Z

I think you might be able to set this to a 30s timeout because without a relay that would still fail after 30s in the old version? You can verify that.

flub · 2025-06-27T10:30:45Z

Alternatively you can mimic what #3350 does and measure the time it takes to connect initially. Then require the 2nd connection to succeed in 3 times that base time. Could be more robust?

Frando · 2025-06-27T11:41:08Z

The flakyness is not only due to slow CI I think, I got it to fail locally too from time to time. My current reasoning is that the fix applied here is not enough (pasting from discord):

so with the change from the PR, a new path is pinged. once we get the pong, we check if this path should be the new best_addr. which takes us to here:
https://github.com/n0-computer/iroh/blob/main/iroh/src/magicsock/node_map/best_addr.rs#L111
the new path will only become the best_addr (and thus the send addr for the next quic packet) if it has a lower latency than the current best_addr, or if the current best_addr exceeds its trust interval
in our test case, the previous addr is still trusted. the path is also physically identical, so it is purely random which of the two pingpongs had a lower latency
if the pingpong from the new client endpoint has a lower latency, the addr from the new client endpoint is used from then on. if the pingpong from the old client endpoint happened to have a lower latency, the best_addr is kept and future udp packets are sent into nirvana

Frando · 2025-06-30T09:26:04Z

I pushed a commit that makes the test pass reliably (locally for me at least):

a8ee257

It adds the following case to BestAddr::insert_if_better_or_reconfirm, which is called when a pong is received on a UDP path:

// If we receive a pong on a different port but the same IP address as the current best addr,
// we assume that the endpoint has rebound, and thus use the new port.
} else if state.addr.addr.ip() == addr.ip() {
    self.insert(addr, latency, source, confirmed_at)
}

So when we receive a pong from the same IP address as our current best_addr, but with a different port, we unconditionally switch our best addr to that new port, even if the latency is lower than the best addr's latency from the last pingpong.

So far this was the simplest I could come up with. Can you think of any scenarios where that would lead to undesired outcomes?

flub · 2025-06-30T14:30:17Z

iroh/src/magicsock/node_map/best_addr.rs

                } else if state.addr.addr == addr {
                    state.confirmed_at = confirmed_at;
                    state.trust_until = Some(source.trust_until(confirmed_at));
+                // If we receive a pong on a different port but the same IP address as the current best addr,
+                // we assume that the endpoint has rebound, and thus use the new port.


My main issue with this logic is that I find it fairly arbitrary. It fixes an artificial test case, but it seems like a sub-set of the cases that need to be fixed. We're also not really sure we can send on this path at this point, but I'm not even sure if that's something we're always sure of anyway.

tests(iroh): add test to reconnect after forceful abort

14c6d1d

n0bot bot added this to iroh Jun 27, 2025

github-project-automation bot moved this to 🏗 In progress in iroh Jun 27, 2025

Frando force-pushed the Frando/test-abort-reconnect branch from 56c059c to 34ec2a2 Compare June 27, 2025 09:14

Frando changed the title ~~tests(iroh): add test to reconnect after forceful abort~~ tests(iroh): failing test to reconnect after endpoint recreation Jun 27, 2025

Frando changed the title ~~tests(iroh): failing test to reconnect after endpoint recreation~~ tests(iroh): failing test for reconnecting after endpoint recreation Jun 27, 2025

Frando force-pushed the Frando/test-abort-reconnect branch 2 times, most recently from fc031ea to 01a23ec Compare June 27, 2025 09:37

Frando changed the title ~~tests(iroh): failing test for reconnecting after endpoint recreation~~ fix(iroh): always ping new paths, add test for reconnecting with changed addrs Jun 27, 2025

Frando added 2 commits June 27, 2025 11:44

tests: simplify abort-and-reconnect test

9866f0e

fix(iroh): always ping new paths

087f003

Frando force-pushed the Frando/test-abort-reconnect branch from 01a23ec to 087f003 Compare June 27, 2025 09:44

flub approved these changes Jun 27, 2025

View reviewed changes

flub mentioned this pull request Jun 27, 2025

A test for weird long connection behavior from same Node IDs #3350

Draft

11 tasks

tests: improve reconnect test

a00644a

Frando force-pushed the Frando/test-abort-reconnect branch from 0b10a19 to a00644a Compare June 27, 2025 11:47

fix(iroh): when ports change for the current best address, use it

a8ee257

Frando requested a review from flub June 30, 2025 09:31

flub reviewed Jun 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(iroh): always ping new paths, add test for reconnecting with changed addrs #3372

fix(iroh): always ping new paths, add test for reconnecting with changed addrs #3372

Uh oh!

Frando commented Jun 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

flub left a comment

Uh oh!

Uh oh!

flub Jun 27, 2025

Uh oh!

Frando Jun 27, 2025

Uh oh!

flub commented Jun 27, 2025

Uh oh!

flub commented Jun 27, 2025

Uh oh!

flub commented Jun 27, 2025

Uh oh!

Frando commented Jun 27, 2025

Uh oh!

Frando commented Jun 30, 2025

Uh oh!

flub Jun 30, 2025

Uh oh!

Uh oh!

fix(iroh): always ping new paths, add test for reconnecting with changed addrs #3372

Are you sure you want to change the base?

fix(iroh): always ping new paths, add test for reconnecting with changed addrs #3372

Uh oh!

Conversation

Frando commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Breaking Changes

Notes & open questions

Change checklist

Uh oh!

github-actions bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flub Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Frando Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

flub commented Jun 27, 2025

Uh oh!

flub commented Jun 27, 2025

Uh oh!

flub commented Jun 27, 2025

Uh oh!

Frando commented Jun 27, 2025

Uh oh!

Frando commented Jun 30, 2025

Uh oh!

flub Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Frando commented Jun 27, 2025 •

edited

Loading

github-actions bot commented Jun 27, 2025 •

edited

Loading

github-actions bot commented Jun 27, 2025 •

edited

Loading