Skip to content

A test for weird long connection behavior from same Node IDs #3350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RingOfStorms
Copy link

@RingOfStorms RingOfStorms commented Jun 16, 2025

Description

Regarding this Discord halp thread.

Trying to figure out how to prove there is weird connection bugs in iroh.

Specifically:

  • Graceful close is not graceful, note that when logging is on you will see WARN failed to clean up tasks on any of the graceful examples
    • run cargo test -- --nocapture --test graceful_close to see these logs, this is seen on the main branch of iroh as of writing this.

Sample output

2025-06-16T21:05:34.991992Z DEBUG graceful_close:ep{me=ac83794a87}:magicsock:actor: iroh::magicsock: shutdown complete
2025-06-16T21:05:35.093639Z  WARN graceful_close:close{me=ac83794a87}: iroh::magicsock: tasks didn't finish in time, aborting remaining 1/3 tasks
2025-06-16T21:05:35.093786Z TRACE graceful_close:close{me=ac83794a87}: iroh::magicsock: magicsock closed
test endpoint::tests::graceful_close ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 78 filtered out; finished in 4.75s
  • Connecting from the same node ID (using same secret) can take extremely long amounts of time. First run usually takes < 3 seconds and a subsequent run will take > 10 seconds
    • With this branch run: cargo test --features="discovery-local-network" -- --nocapture --test can_connect_from_same_id

Sample output

2025-06-16T22:02:49.372917Z TRACE can_connect_from_same_id:connect{me="3b6a27bcce" alpn="n0/iroh/test" remote="378fc07264"}:drive{id=0}: iroh_quinn_proto::connection: timeout timer=Close
2025-06-16T22:02:49.373051Z TRACE can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor: iroh::magicsock: tick: msg msg=Shutdown
2025-06-16T22:02:49.373071Z DEBUG can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor: iroh::magicsock: shutting down
2025-06-16T22:02:49.373112Z  INFO can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor:reset{node=378fc07264}: iroh::magicsock::node_map::best_addr: clearing best_addr reason=Reset has_relay=false old_addr=[omitted]:34489
2025-06-16T22:02:49.373133Z DEBUG can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor: iroh::magicsock: shutdown complete
2025-06-16T22:02:49.473882Z  WARN can_connect_from_same_id:close{me=3b6a27bcce}: iroh::magicsock: tasks didn't finish in time, aborting remaining 1/3 tasks
2025-06-16T22:02:49.474052Z TRACE can_connect_from_same_id:close{me=3b6a27bcce}: iroh::magicsock: magicsock closed
Elapsed time for connection 0: 1.206381743s
Elapsed time for connection 1: 22.193293074s
Elapsed time for connection 2: 6.21092495s
test endpoint::tests::can_connect_from_same_id ... ok

Breaking Changes

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

Comment on lines 3112 to 3119
elapsed_times.iter().enumerate().for_each(|(i, elapsed)| {
println!("Elapsed time for connection {i}: {elapsed:?}");
});

// assert!(elapsed1 < Duration::from_secs(5));
// assert!(elapsed2 < Duration::from_secs(5));
// check that elapsed2 completed within 1 second of elapsed1
// assert!(elapsed2 <= elapsed1 + Duration::from_secs(1));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to figure out a way to properly "test" that the subsequent elapsed times are way too long to be normal and are indicative of an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing this in a way that won't be flaky is trick for sure. Maybe you could take the minimum time as a baseline. And not allow any of the times to exceed 3 times this baseline? From the times you have posted that seems reasonable.

@n0bot n0bot bot added this to iroh Jun 16, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Jun 16, 2025
Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks good!


let server = Endpoint::builder()
.alpns(vec![TEST_ALPN.to_vec()])
.discovery_local_network()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this test really needs a discovery service. Would it not work anyway because both endpoints are on the same network, so the NodeAddr should contain a usable IP addr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interestingly enough it seems like this issue is maybe related to the relay_mode, and not discovery local.

When I remove discovery local this issue still persists so you are right on that front.

but the relay mode when not applied results in it not taking a long time, so when that is set to disabled is related to this issue. Good call out, this narrows it down a bit more.

Also I wonder if that means a "good" test here would need to test this scenario with many endpoint configurations, as there could be a regression in the one with or without certain features enabled.

Comment on lines 3112 to 3119
elapsed_times.iter().enumerate().for_each(|(i, elapsed)| {
println!("Elapsed time for connection {i}: {elapsed:?}");
});

// assert!(elapsed1 < Duration::from_secs(5));
// assert!(elapsed2 < Duration::from_secs(5));
// check that elapsed2 completed within 1 second of elapsed1
// assert!(elapsed2 <= elapsed1 + Duration::from_secs(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing this in a way that won't be flaky is trick for sure. Maybe you could take the minimum time as a baseline. And not allow any of the times to exceed 3 times this baseline? From the times you have posted that seems reasonable.

@RingOfStorms
Copy link
Author

Some more findings:

  • Local discovery doesn't seem to be the thing that affects this behavior right now
  • Relay mode DISABLED seems to be causing the issues
    • It is worse when the server side has relay mode disabled
    • I did a test with relay_mode option supplied for each permutation of server/client combo and all 4 states have quite different timings

Based on the sample runs below I can theorize these two things:

  • when server side is set to relay mode disabled it does something to cause really long subsequent connections
  • And interestingly when relay mode disabled on client only it actually speeds things up compared to not supplied to either. This seems surprising to me but I don't know how this setting affects the actual internals so maybe this is expected.

relay_mode option not supplied, 10 echos

Elapsed time for connection 0: 2.621278986s
Elapsed time for connection 1: 2.755411581s
Elapsed time for connection 2: 2.876377633s
Elapsed time for connection 3: 2.707872225s
Elapsed time for connection 4: 2.799982384s
Elapsed time for connection 5: 2.92761123s
Elapsed time for connection 6: 2.919052842s
Elapsed time for connection 7: 2.772311601s
Elapsed time for connection 8: 2.722433186s
Elapsed time for connection 9: 2.81553961s

.relay_mode(RelayMode::Disabled) on server side only, 10 echos

Elapsed time for connection 0: 2.463296988s
Elapsed time for connection 1: 6.206893887s
Elapsed time for connection 2: 22.188036821s
Elapsed time for connection 3: 22.186350997s
Elapsed time for connection 4: 6.207294379s
Elapsed time for connection 5: 22.193584083s
Elapsed time for connection 6: 22.188255523s
Elapsed time for connection 7: 6.2039095s
Elapsed time for connection 8: 6.204400337s
Elapsed time for connection 9: 22.188778425s

.relay_mode(RelayMode::Disabled) on client side only, 10 echos

Elapsed time for connection 0: 480.599987ms
Elapsed time for connection 1: 1.420993144s
Elapsed time for connection 2: 1.41849239s
Elapsed time for connection 3: 1.424476043s
Elapsed time for connection 4: 1.424771492s
Elapsed time for connection 5: 1.429121979s
Elapsed time for connection 6: 1.430509926s
Elapsed time for connection 7: 1.41904525s
Elapsed time for connection 8: 1.438335429s
Elapsed time for connection 9: 1.446379564s

.relay_mode(RelayMode::Disabled) on both, 10 echos

Elapsed time for connection 0: 1.202893454s
Elapsed time for connection 1: 22.194231071s
Elapsed time for connection 2: 6.205596609s
Elapsed time for connection 3: 22.189599189s
Elapsed time for connection 4: 6.205128527s
Elapsed time for connection 5: 22.189607762s
Elapsed time for connection 6: 22.190115474s
Elapsed time for connection 7: 6.206592532s
Elapsed time for connection 8: 22.190172041s
Elapsed time for connection 9: 21.193976603s

@RingOfStorms
Copy link
Author

RingOfStorms commented Jun 17, 2025

Also to further confirm that this only applies when the client re-uses the same secret key I ran the above with no secret_key option on the client and while they all completed much faster than above, there are still similar looking differences between the run. So maybe secret key is further exacerbating the same underlying issue? I am not sure.

relay_mode option not supplied, 10 echos

Elapsed time for connection 0: 3.52039683s
Elapsed time for connection 1: 3.520287474s
Elapsed time for connection 2: 3.558214873s
Elapsed time for connection 3: 2.452368535s
Elapsed time for connection 4: 2.524942325s
Elapsed time for connection 5: 2.562164633s
Elapsed time for connection 6: 2.683931056s
Elapsed time for connection 7: 3.545730718s
Elapsed time for connection 8: 2.469515744s
Elapsed time for connection 9: 2.533668437s

.relay_mode(RelayMode::Disabled) on server side only, 10 echos

Elapsed time for connection 0: 2.431508161s
Elapsed time for connection 1: 2.491735194s
Elapsed time for connection 2: 2.454442741s
Elapsed time for connection 3: 2.472972864s
Elapsed time for connection 4: 2.492294508s
Elapsed time for connection 5: 2.528253839s
Elapsed time for connection 6: 2.477290178s
Elapsed time for connection 7: 2.51136393s
Elapsed time for connection 8: 2.461494198s
Elapsed time for connection 9: 2.482727827s

.relay_mode(RelayMode::Disabled) on client side only, 10 echos
Strangely the first run in this case is very long then very short with some intermittent long connections, this would fail the proposed x3 test assertion and may be a different cause...
edit: I have done more runs of 10 echos with this configuration and I've hit the long seconds connection in the middle or often times not at all, sometimes it is just fine and all run the same. Maybe this is a fluke?

Another run of relay disabled on client side only with 50 echos

Elapsed time for connection 0: 454.484329ms
Elapsed time for connection 1: 3.439202428s
Elapsed time for connection 2: 483.412063ms
Elapsed time for connection 3: 453.793292ms
Elapsed time for connection 4: 463.042489ms
Elapsed time for connection 5: 457.020213ms
Elapsed time for connection 6: 469.528637ms
Elapsed time for connection 7: 486.915916ms
Elapsed time for connection 8: 488.905951ms
Elapsed time for connection 9: 463.121055ms
Elapsed time for connection 10: 464.761007ms
Elapsed time for connection 11: 451.984235ms
Elapsed time for connection 12: 452.732974ms
Elapsed time for connection 13: 478.018431ms
Elapsed time for connection 14: 483.07336ms
Elapsed time for connection 15: 459.111452ms
Elapsed time for connection 16: 489.106585ms
Elapsed time for connection 17: 467.113396ms
Elapsed time for connection 18: 462.644819ms
Elapsed time for connection 19: 479.456706ms
Elapsed time for connection 20: 450.815345ms
Elapsed time for connection 21: 471.021132ms
Elapsed time for connection 22: 456.06071ms
Elapsed time for connection 23: 467.25306ms
Elapsed time for connection 24: 457.837728ms
Elapsed time for connection 25: 478.9684ms
Elapsed time for connection 26: 3.431005028s //// NOTE
Elapsed time for connection 27: 484.92771ms
Elapsed time for connection 28: 466.864109ms
Elapsed time for connection 29: 476.464074ms
Elapsed time for connection 30: 481.843003ms
Elapsed time for connection 31: 490.781304ms
Elapsed time for connection 32: 488.21522ms
Elapsed time for connection 33: 485.857017ms
Elapsed time for connection 34: 454.482103ms
Elapsed time for connection 35: 478.948288ms
Elapsed time for connection 36: 468.59286ms
Elapsed time for connection 37: 490.488251ms
Elapsed time for connection 38: 462.908715ms
Elapsed time for connection 39: 492.200869ms
Elapsed time for connection 40: 486.110774ms
Elapsed time for connection 41: 531.070594ms
Elapsed time for connection 42: 489.180323ms
Elapsed time for connection 43: 450.95618ms
Elapsed time for connection 44: 457.455389ms
Elapsed time for connection 45: 484.012553ms
Elapsed time for connection 46: 447.19992ms
Elapsed time for connection 47: 456.92218ms
Elapsed time for connection 48: 449.062418ms
Elapsed time for connection 49: 473.720509ms

Elapsed time for connection 0: 4.293278282s /// NOTE
Elapsed time for connection 1: 475.538915ms
Elapsed time for connection 2: 452.222589ms
Elapsed time for connection 3: 460.007996ms
Elapsed time for connection 4: 478.678748ms
Elapsed time for connection 5: 905.055489ms
Elapsed time for connection 6: 464.185802ms
Elapsed time for connection 7: 3.446739172s
Elapsed time for connection 8: 449.412699ms
Elapsed time for connection 9: 467.608995ms

.relay_mode(RelayMode::Disabled) on both, 10 echos

Elapsed time for connection 0: 1.203015678s
Elapsed time for connection 1: 1.206038073s
Elapsed time for connection 2: 1.209485912s
Elapsed time for connection 3: 1.210431619s
Elapsed time for connection 4: 1.214886733s
Elapsed time for connection 5: 1.211535852s
Elapsed time for connection 6: 1.204868818s
Elapsed time for connection 7: 1.211356034s
Elapsed time for connection 8: 1.206658554s
Elapsed time for connection 9: 1.205175969s

@flub
Copy link
Contributor

flub commented Jun 18, 2025

The problem is with the state that the server keeps: the client migrates to a new address but the server keeps trying to reach it on the old one. This is not a problem when the server is reachable on the relay server: the client will contact the server via the relay if it gets no response on the direct path (or it will race the two or something, I forgot the details but racing is more likely). It will then initiate holepunching and by doing so the server learns about the new addresses of the client that can be used, and the direct connection will succeed (even though holepunching is not technically needed here, it is only used to learn about the direct addresses).

Enabling the relay on the client side has no effect because it is never used: the client connects to the server so only the relay of the server is ever used.

For the test I suggest you disable relay servers on both sides (no need to contact a relay server or spin up a test one). And do the latencies check for establishing a connection. That's enough to show the bug.

Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we tidy things up we can probably merge this. Let's make the test name start with expected_failure and mark it with #[ignore = "issue #xyz"] and file an issue describing that the server is sending to the wrong address for too long, more or less what I say in the other comment. Then this should be more or less ready for merging I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove things here again? This is an example, not a debugging tool.

The warning can be checked for in a test. It is about some internal cleanup, nothing on the application layer is misbehaving.

conn.close(42u32.into(), b"thanks, bye!");
// TODO this causes a warn that things are not cleaned up gracefully, how can we fail a
// test due to that?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said on the example: this is about some internal cleanup, nothing that misbehaves for users. You can track this in a separate test that checks there are no warnings logged for a normal connection close.

@flub
Copy link
Contributor

flub commented Jun 27, 2025

Hey, this is being worked on in #3372. Which can probably replace this as it also contains a (so far flaky...) test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

2 participants