-
Notifications
You must be signed in to change notification settings - Fork 254
fix(iroh): always ping new paths, add test for reconnecting with changed addrs #3372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3372/docs/iroh/ Last updated: 2025-06-30T09:24:39Z |
56c059c
to
34ec2a2
Compare
fc031ea
to
01a23ec
Compare
01a23ec
to
087f003
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good fix! I can't think of any reason why this extra ping back would ever do much harm (famous last words).
My only concern is the 5s is asking for flakyness, but I can't really think of something better right away.
iroh/src/endpoint.rs
Outdated
.e()??; | ||
assert_eq!(rx.recv().await.unwrap().unwrap(), 23); | ||
// close the endpoint in a separate task, to not lose time for our immediate respawn testing | ||
let close1 = tokio::task::spawn(async move { ep.close().await }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this take? Since both endpoints are still there this should be pretty fast, no? Feels a bit odd to spawn this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I'll remove it again. It was a trial for the windows failure, but that's not it, see comment below
Oh, lol. Windows CI already fails because of this... Guess we'll have to figure this out right away. |
I think you might be able to set this to a 30s timeout because without a relay that would still fail after 30s in the old version? You can verify that. |
Alternatively you can mimic what #3350 does and measure the time it takes to connect initially. Then require the 2nd connection to succeed in 3 times that base time. Could be more robust? |
The flakyness is not only due to slow CI I think, I got it to fail locally too from time to time. My current reasoning is that the fix applied here is not enough (pasting from discord): so with the change from the PR, a new path is pinged. once we get the pong, we check if this path should be the new best_addr. which takes us to here: |
0b10a19
to
a00644a
Compare
I pushed a commit that makes the test pass reliably (locally for me at least): It adds the following case to // If we receive a pong on a different port but the same IP address as the current best addr,
// we assume that the endpoint has rebound, and thus use the new port.
} else if state.addr.addr.ip() == addr.ip() {
self.insert(addr, latency, source, confirmed_at)
} So when we receive a pong from the same IP address as our current best_addr, but with a different port, we unconditionally switch our best addr to that new port, even if the latency is lower than the best addr's latency from the last pingpong. So far this was the simplest I could come up with. Can you think of any scenarios where that would lead to undesired outcomes? |
} else if state.addr.addr == addr { | ||
state.confirmed_at = confirmed_at; | ||
state.trust_until = Some(source.trust_until(confirmed_at)); | ||
// If we receive a pong on a different port but the same IP address as the current best addr, | ||
// we assume that the endpoint has rebound, and thus use the new port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main issue with this logic is that I find it fairly arbitrary. It fixes an artificial test case, but it seems like a sub-set of the cases that need to be fixed. We're also not really sure we can send on this path at this point, but I'm not even sure if that's something we're always sure of anyway.
Description
Adds a test that closes a client endpoint and recreates it right away with the same node id. It tries to connect to a server it was previously connected to to. The test fails on main.
The reason is that the server thinks the client's previous UDP send address is still valid. Even though it receives both disco pings and quic packets from the new address, it doesn't update the best_addr used for sending, therefore the server's QUIC replies are sent into nirvana, and the test times out.
The second commit fixes this behavior by always sending disco pings to new paths.
Breaking Changes
Notes & open questions
Change checklist
quic-rpc
iroh-gossip
iroh-blobs
dumbpipe
sendme