-
Notifications
You must be signed in to change notification settings - Fork 254
A test for weird long connection behavior from same Node IDs #3350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
iroh/src/endpoint.rs
Outdated
elapsed_times.iter().enumerate().for_each(|(i, elapsed)| { | ||
println!("Elapsed time for connection {i}: {elapsed:?}"); | ||
}); | ||
|
||
// assert!(elapsed1 < Duration::from_secs(5)); | ||
// assert!(elapsed2 < Duration::from_secs(5)); | ||
// check that elapsed2 completed within 1 second of elapsed1 | ||
// assert!(elapsed2 <= elapsed1 + Duration::from_secs(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to figure out a way to properly "test" that the subsequent elapsed times are way too long to be normal and are indicative of an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing this in a way that won't be flaky is trick for sure. Maybe you could take the minimum time as a baseline. And not allow any of the times to exceed 3 times this baseline? From the times you have posted that seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, looks good!
iroh/src/endpoint.rs
Outdated
|
||
let server = Endpoint::builder() | ||
.alpns(vec![TEST_ALPN.to_vec()]) | ||
.discovery_local_network() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this test really needs a discovery service. Would it not work anyway because both endpoints are on the same network, so the NodeAddr should contain a usable IP addr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interestingly enough it seems like this issue is maybe related to the relay_mode, and not discovery local.
When I remove discovery local this issue still persists so you are right on that front.
but the relay mode when not applied results in it not taking a long time, so when that is set to disabled is related to this issue. Good call out, this narrows it down a bit more.
Also I wonder if that means a "good" test here would need to test this scenario with many endpoint configurations, as there could be a regression in the one with or without certain features enabled.
iroh/src/endpoint.rs
Outdated
elapsed_times.iter().enumerate().for_each(|(i, elapsed)| { | ||
println!("Elapsed time for connection {i}: {elapsed:?}"); | ||
}); | ||
|
||
// assert!(elapsed1 < Duration::from_secs(5)); | ||
// assert!(elapsed2 < Duration::from_secs(5)); | ||
// check that elapsed2 completed within 1 second of elapsed1 | ||
// assert!(elapsed2 <= elapsed1 + Duration::from_secs(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing this in a way that won't be flaky is trick for sure. Maybe you could take the minimum time as a baseline. And not allow any of the times to exceed 3 times this baseline? From the times you have posted that seems reasonable.
Some more findings:
Based on the sample runs below I can theorize these two things:
|
Also to further confirm that this only applies when the client re-uses the same secret key I ran the above with no
Another run of relay disabled on client side only with 50 echos
|
The problem is with the state that the server keeps: the client migrates to a new address but the server keeps trying to reach it on the old one. This is not a problem when the server is reachable on the relay server: the client will contact the server via the relay if it gets no response on the direct path (or it will race the two or something, I forgot the details but racing is more likely). It will then initiate holepunching and by doing so the server learns about the new addresses of the client that can be used, and the direct connection will succeed (even though holepunching is not technically needed here, it is only used to learn about the direct addresses). Enabling the relay on the client side has no effect because it is never used: the client connects to the server so only the relay of the server is ever used. For the test I suggest you disable relay servers on both sides (no need to contact a relay server or spin up a test one). And do the latencies check for establishing a connection. That's enough to show the bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we tidy things up we can probably merge this. Let's make the test name start with expected_failure
and mark it with #[ignore = "issue #xyz"]
and file an issue describing that the server is sending to the wrong address for too long, more or less what I say in the other comment. Then this should be more or less ready for merging I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove things here again? This is an example, not a debugging tool.
The warning can be checked for in a test. It is about some internal cleanup, nothing on the application layer is misbehaving.
conn.close(42u32.into(), b"thanks, bye!"); | ||
// TODO this causes a warn that things are not cleaned up gracefully, how can we fail a | ||
// test due to that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said on the example: this is about some internal cleanup, nothing that misbehaves for users. You can track this in a separate test that checks there are no warnings logged for a normal connection close.
Hey, this is being worked on in #3372. Which can probably replace this as it also contains a (so far flaky...) test. |
Description
Regarding this Discord halp thread.
Trying to figure out how to prove there is weird connection bugs in iroh.
Specifically:
cargo test -- --nocapture --test graceful_close
to see these logs, this is seen on the main branch of iroh as of writing this.Sample output
cargo test --features="discovery-local-network" -- --nocapture --test can_connect_from_same_id
Sample output
Breaking Changes
Notes & open questions
Change checklist
quic-rpc
iroh-gossip
iroh-blobs
dumbpipe
sendme