A test for weird long connection behavior from same Node IDs #3350

RingOfStorms · 2025-06-16T22:10:16Z

Description

Trying to figure out how to prove there is weird connection bugs in iroh.

Specifically:

Graceful close is not graceful, note that when logging is on you will see WARN failed to clean up tasks on any of the graceful examples
- run cargo test -- --nocapture --test graceful_close to see these logs, this is seen on the main branch of iroh as of writing this.

Sample output

2025-06-16T21:05:34.991992Z DEBUG graceful_close:ep{me=ac83794a87}:magicsock:actor: iroh::magicsock: shutdown complete
2025-06-16T21:05:35.093639Z  WARN graceful_close:close{me=ac83794a87}: iroh::magicsock: tasks didn't finish in time, aborting remaining 1/3 tasks
2025-06-16T21:05:35.093786Z TRACE graceful_close:close{me=ac83794a87}: iroh::magicsock: magicsock closed
test endpoint::tests::graceful_close ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 78 filtered out; finished in 4.75s

Connecting from the same node ID (using same secret) can take extremely long amounts of time. First run usually takes < 3 seconds and a subsequent run will take > 10 seconds
- With this branch run: cargo test --features="discovery-local-network" -- --nocapture --test can_connect_from_same_id

Sample output

2025-06-16T22:02:49.372917Z TRACE can_connect_from_same_id:connect{me="3b6a27bcce" alpn="n0/iroh/test" remote="378fc07264"}:drive{id=0}: iroh_quinn_proto::connection: timeout timer=Close
2025-06-16T22:02:49.373051Z TRACE can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor: iroh::magicsock: tick: msg msg=Shutdown
2025-06-16T22:02:49.373071Z DEBUG can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor: iroh::magicsock: shutting down
2025-06-16T22:02:49.373112Z  INFO can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor:reset{node=378fc07264}: iroh::magicsock::node_map::best_addr: clearing best_addr reason=Reset has_relay=false old_addr=[omitted]:34489
2025-06-16T22:02:49.373133Z DEBUG can_connect_from_same_id:ep{me=3b6a27bcce}:magicsock:actor: iroh::magicsock: shutdown complete
2025-06-16T22:02:49.473882Z  WARN can_connect_from_same_id:close{me=3b6a27bcce}: iroh::magicsock: tasks didn't finish in time, aborting remaining 1/3 tasks
2025-06-16T22:02:49.474052Z TRACE can_connect_from_same_id:close{me=3b6a27bcce}: iroh::magicsock: magicsock closed
Elapsed time for connection 0: 1.206381743s
Elapsed time for connection 1: 22.193293074s
Elapsed time for connection 2: 6.21092495s
test endpoint::tests::can_connect_from_same_id ... ok

Breaking Changes

Notes & open questions

Change checklist

RingOfStorms · 2025-06-16T22:13:20Z

iroh/src/endpoint.rs

+        elapsed_times.iter().enumerate().for_each(|(i, elapsed)| {
+            println!("Elapsed time for connection {i}: {elapsed:?}");
+        });
+
+        // assert!(elapsed1 < Duration::from_secs(5));
+        // assert!(elapsed2 < Duration::from_secs(5));
+        // check that elapsed2 completed within 1 second of elapsed1
+        // assert!(elapsed2 <= elapsed1 + Duration::from_secs(1));


I'm trying to figure out a way to properly "test" that the subsequent elapsed times are way too long to be normal and are indicative of an error.

Writing this in a way that won't be flaky is trick for sure. Maybe you could take the minimum time as a baseline. And not allow any of the times to exceed 3 times this baseline? From the times you have posted that seems reasonable.

flub

Thanks for this, looks good!

flub · 2025-06-17T08:37:12Z

iroh/src/endpoint.rs

+
+        let server = Endpoint::builder()
+            .alpns(vec![TEST_ALPN.to_vec()])
+            .discovery_local_network()


I wonder if this test really needs a discovery service. Would it not work anyway because both endpoints are on the same network, so the NodeAddr should contain a usable IP addr.

Hmm interestingly enough it seems like this issue is maybe related to the relay_mode, and not discovery local.

When I remove discovery local this issue still persists so you are right on that front.

but the relay mode when not applied results in it not taking a long time, so when that is set to disabled is related to this issue. Good call out, this narrows it down a bit more.

Also I wonder if that means a "good" test here would need to test this scenario with many endpoint configurations, as there could be a regression in the one with or without certain features enabled.

flub · 2025-06-17T08:40:08Z

iroh/src/endpoint.rs

+        elapsed_times.iter().enumerate().for_each(|(i, elapsed)| {
+            println!("Elapsed time for connection {i}: {elapsed:?}");
+        });
+
+        // assert!(elapsed1 < Duration::from_secs(5));
+        // assert!(elapsed2 < Duration::from_secs(5));
+        // check that elapsed2 completed within 1 second of elapsed1
+        // assert!(elapsed2 <= elapsed1 + Duration::from_secs(1));


Writing this in a way that won't be flaky is trick for sure. Maybe you could take the minimum time as a baseline. And not allow any of the times to exceed 3 times this baseline? From the times you have posted that seems reasonable.

RingOfStorms · 2025-06-17T18:15:40Z

Some more findings:

Local discovery doesn't seem to be the thing that affects this behavior right now
Relay mode DISABLED seems to be causing the issues
- It is worse when the server side has relay mode disabled
- I did a test with relay_mode option supplied for each permutation of server/client combo and all 4 states have quite different timings

Based on the sample runs below I can theorize these two things:

when server side is set to relay mode disabled it does something to cause really long subsequent connections
And interestingly when relay mode disabled on client only it actually speeds things up compared to not supplied to either. This seems surprising to me but I don't know how this setting affects the actual internals so maybe this is expected.

relay_mode option not supplied, 10 echos

Elapsed time for connection 0: 2.621278986s
Elapsed time for connection 1: 2.755411581s
Elapsed time for connection 2: 2.876377633s
Elapsed time for connection 3: 2.707872225s
Elapsed time for connection 4: 2.799982384s
Elapsed time for connection 5: 2.92761123s
Elapsed time for connection 6: 2.919052842s
Elapsed time for connection 7: 2.772311601s
Elapsed time for connection 8: 2.722433186s
Elapsed time for connection 9: 2.81553961s

.relay_mode(RelayMode::Disabled) on server side only, 10 echos

Elapsed time for connection 0: 2.463296988s
Elapsed time for connection 1: 6.206893887s
Elapsed time for connection 2: 22.188036821s
Elapsed time for connection 3: 22.186350997s
Elapsed time for connection 4: 6.207294379s
Elapsed time for connection 5: 22.193584083s
Elapsed time for connection 6: 22.188255523s
Elapsed time for connection 7: 6.2039095s
Elapsed time for connection 8: 6.204400337s
Elapsed time for connection 9: 22.188778425s

.relay_mode(RelayMode::Disabled) on client side only, 10 echos

Elapsed time for connection 0: 480.599987ms
Elapsed time for connection 1: 1.420993144s
Elapsed time for connection 2: 1.41849239s
Elapsed time for connection 3: 1.424476043s
Elapsed time for connection 4: 1.424771492s
Elapsed time for connection 5: 1.429121979s
Elapsed time for connection 6: 1.430509926s
Elapsed time for connection 7: 1.41904525s
Elapsed time for connection 8: 1.438335429s
Elapsed time for connection 9: 1.446379564s

.relay_mode(RelayMode::Disabled) on both, 10 echos

Elapsed time for connection 0: 1.202893454s
Elapsed time for connection 1: 22.194231071s
Elapsed time for connection 2: 6.205596609s
Elapsed time for connection 3: 22.189599189s
Elapsed time for connection 4: 6.205128527s
Elapsed time for connection 5: 22.189607762s
Elapsed time for connection 6: 22.190115474s
Elapsed time for connection 7: 6.206592532s
Elapsed time for connection 8: 22.190172041s
Elapsed time for connection 9: 21.193976603s

RingOfStorms · 2025-06-17T18:25:58Z

Also to further confirm that this only applies when the client re-uses the same secret key I ran the above with no secret_key option on the client and while they all completed much faster than above, there are still similar looking differences between the run. So maybe secret key is further exacerbating the same underlying issue? I am not sure.

relay_mode option not supplied, 10 echos

Elapsed time for connection 0: 3.52039683s
Elapsed time for connection 1: 3.520287474s
Elapsed time for connection 2: 3.558214873s
Elapsed time for connection 3: 2.452368535s
Elapsed time for connection 4: 2.524942325s
Elapsed time for connection 5: 2.562164633s
Elapsed time for connection 6: 2.683931056s
Elapsed time for connection 7: 3.545730718s
Elapsed time for connection 8: 2.469515744s
Elapsed time for connection 9: 2.533668437s

.relay_mode(RelayMode::Disabled) on server side only, 10 echos

Elapsed time for connection 0: 2.431508161s
Elapsed time for connection 1: 2.491735194s
Elapsed time for connection 2: 2.454442741s
Elapsed time for connection 3: 2.472972864s
Elapsed time for connection 4: 2.492294508s
Elapsed time for connection 5: 2.528253839s
Elapsed time for connection 6: 2.477290178s
Elapsed time for connection 7: 2.51136393s
Elapsed time for connection 8: 2.461494198s
Elapsed time for connection 9: 2.482727827s

.relay_mode(RelayMode::Disabled) on client side only, 10 echos
Strangely the first run in this case is very long then very short with some intermittent long connections, this would fail the proposed x3 test assertion and may be a different cause...
edit: I have done more runs of 10 echos with this configuration and I've hit the long seconds connection in the middle or often times not at all, sometimes it is just fine and all run the same. Maybe this is a fluke?

Another run of relay disabled on client side only with 50 echos

Elapsed time for connection 0: 454.484329ms
Elapsed time for connection 1: 3.439202428s
Elapsed time for connection 2: 483.412063ms
Elapsed time for connection 3: 453.793292ms
Elapsed time for connection 4: 463.042489ms
Elapsed time for connection 5: 457.020213ms
Elapsed time for connection 6: 469.528637ms
Elapsed time for connection 7: 486.915916ms
Elapsed time for connection 8: 488.905951ms
Elapsed time for connection 9: 463.121055ms
Elapsed time for connection 10: 464.761007ms
Elapsed time for connection 11: 451.984235ms
Elapsed time for connection 12: 452.732974ms
Elapsed time for connection 13: 478.018431ms
Elapsed time for connection 14: 483.07336ms
Elapsed time for connection 15: 459.111452ms
Elapsed time for connection 16: 489.106585ms
Elapsed time for connection 17: 467.113396ms
Elapsed time for connection 18: 462.644819ms
Elapsed time for connection 19: 479.456706ms
Elapsed time for connection 20: 450.815345ms
Elapsed time for connection 21: 471.021132ms
Elapsed time for connection 22: 456.06071ms
Elapsed time for connection 23: 467.25306ms
Elapsed time for connection 24: 457.837728ms
Elapsed time for connection 25: 478.9684ms
Elapsed time for connection 26: 3.431005028s //// NOTE
Elapsed time for connection 27: 484.92771ms
Elapsed time for connection 28: 466.864109ms
Elapsed time for connection 29: 476.464074ms
Elapsed time for connection 30: 481.843003ms
Elapsed time for connection 31: 490.781304ms
Elapsed time for connection 32: 488.21522ms
Elapsed time for connection 33: 485.857017ms
Elapsed time for connection 34: 454.482103ms
Elapsed time for connection 35: 478.948288ms
Elapsed time for connection 36: 468.59286ms
Elapsed time for connection 37: 490.488251ms
Elapsed time for connection 38: 462.908715ms
Elapsed time for connection 39: 492.200869ms
Elapsed time for connection 40: 486.110774ms
Elapsed time for connection 41: 531.070594ms
Elapsed time for connection 42: 489.180323ms
Elapsed time for connection 43: 450.95618ms
Elapsed time for connection 44: 457.455389ms
Elapsed time for connection 45: 484.012553ms
Elapsed time for connection 46: 447.19992ms
Elapsed time for connection 47: 456.92218ms
Elapsed time for connection 48: 449.062418ms
Elapsed time for connection 49: 473.720509ms

Elapsed time for connection 0: 4.293278282s /// NOTE
Elapsed time for connection 1: 475.538915ms
Elapsed time for connection 2: 452.222589ms
Elapsed time for connection 3: 460.007996ms
Elapsed time for connection 4: 478.678748ms
Elapsed time for connection 5: 905.055489ms
Elapsed time for connection 6: 464.185802ms
Elapsed time for connection 7: 3.446739172s
Elapsed time for connection 8: 449.412699ms
Elapsed time for connection 9: 467.608995ms

.relay_mode(RelayMode::Disabled) on both, 10 echos

Elapsed time for connection 0: 1.203015678s
Elapsed time for connection 1: 1.206038073s
Elapsed time for connection 2: 1.209485912s
Elapsed time for connection 3: 1.210431619s
Elapsed time for connection 4: 1.214886733s
Elapsed time for connection 5: 1.211535852s
Elapsed time for connection 6: 1.204868818s
Elapsed time for connection 7: 1.211356034s
Elapsed time for connection 8: 1.206658554s
Elapsed time for connection 9: 1.205175969s

flub · 2025-06-18T08:23:08Z

The problem is with the state that the server keeps: the client migrates to a new address but the server keeps trying to reach it on the old one. This is not a problem when the server is reachable on the relay server: the client will contact the server via the relay if it gets no response on the direct path (or it will race the two or something, I forgot the details but racing is more likely). It will then initiate holepunching and by doing so the server learns about the new addresses of the client that can be used, and the direct connection will succeed (even though holepunching is not technically needed here, it is only used to learn about the direct addresses).

Enabling the relay on the client side has no effect because it is never used: the client connects to the server so only the relay of the server is ever used.

For the test I suggest you disable relay servers on both sides (no need to contact a relay server or spin up a test one). And do the latencies check for establishing a connection. That's enough to show the bug.

flub

If we tidy things up we can probably merge this. Let's make the test name start with expected_failure and mark it with #[ignore = "issue #xyz"] and file an issue describing that the server is sending to the wrong address for too long, more or less what I say in the other comment. Then this should be more or less ready for merging I think.

flub · 2025-06-18T08:26:54Z

iroh/examples/echo.rs

Can we remove things here again? This is an example, not a debugging tool.

The warning can be checked for in a test. It is about some internal cleanup, nothing on the application layer is misbehaving.

flub · 2025-06-18T08:28:01Z

iroh/src/endpoint.rs

        conn.close(42u32.into(), b"thanks, bye!");
+        // TODO this causes a warn that things are not cleaned up gracefully, how can we fail a
+        // test due to that?


As said on the example: this is about some internal cleanup, nothing that misbehaves for users. You can track this in a separate test that checks there are no warnings logged for a normal connection close.

flub · 2025-06-27T10:31:09Z

Hey, this is being worked on in #3372. Which can probably replace this as it also contains a (so far flaky...) test.

weird tests

abb29cf

RingOfStorms commented Jun 16, 2025

View reviewed changes

n0bot bot added this to iroh Jun 16, 2025

github-project-automation bot moved this to 🏗 In progress in iroh Jun 16, 2025

flub reviewed Jun 17, 2025

View reviewed changes

update test case

4478aee

flub reviewed Jun 18, 2025

View reviewed changes

flub mentioned this pull request Jun 27, 2025

fix(iroh): always ping new paths, add test for reconnecting with changed addrs #3372

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A test for weird long connection behavior from same Node IDs #3350

A test for weird long connection behavior from same Node IDs #3350

Uh oh!

RingOfStorms commented Jun 16, 2025 •

edited

Loading

Uh oh!

RingOfStorms Jun 16, 2025

Uh oh!

flub Jun 17, 2025

Uh oh!

flub left a comment

Uh oh!

flub Jun 17, 2025

Uh oh!

RingOfStorms Jun 17, 2025

Uh oh!

flub Jun 17, 2025

Uh oh!

RingOfStorms commented Jun 17, 2025

Uh oh!

RingOfStorms commented Jun 17, 2025 •

edited

Loading

Uh oh!

flub commented Jun 18, 2025

Uh oh!

flub left a comment

Uh oh!

flub Jun 18, 2025

Uh oh!

flub Jun 18, 2025

Uh oh!

flub commented Jun 27, 2025

Uh oh!

Uh oh!

A test for weird long connection behavior from same Node IDs #3350

Are you sure you want to change the base?

A test for weird long connection behavior from same Node IDs #3350

Uh oh!

Conversation

RingOfStorms commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Breaking Changes

Notes & open questions

Change checklist

Uh oh!

RingOfStorms Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

flub Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

flub left a comment

Choose a reason for hiding this comment

Uh oh!

flub Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

RingOfStorms Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

flub Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

RingOfStorms commented Jun 17, 2025

Uh oh!

RingOfStorms commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flub commented Jun 18, 2025

Uh oh!

flub left a comment

Choose a reason for hiding this comment

Uh oh!

flub Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

flub Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

flub commented Jun 27, 2025

Uh oh!

Uh oh!

RingOfStorms commented Jun 16, 2025 •

edited

Loading

RingOfStorms commented Jun 17, 2025 •

edited

Loading