Skip to content

fix: backoff before retry if relay connection terminates #3254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 7, 2025

Conversation

Frando
Copy link
Member

@Frando Frando commented Apr 3, 2025

Description

Currently, if the relay server aborts a connection, we would reconnect to the relay server right away. If the relay server happens to always abort the connection right after establishing, we would be retrying in a hot loop without pause. This happens reliably if the relay server aborts the connection because it rejects the client as not authorized.

This PR has two changes:

  • The existing backoff for retrying failed dials is move up a layer, so that we also backoff after the connection closes before retrying.
  • If the server reports a Health problem, which it does for unauthenticated clients right before closing the connection, this is logged with WARN level

Breaking Changes

Notes & open questions

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

Copy link

github-actions bot commented Apr 3, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3254/docs/iroh/

Last updated: 2025-04-07T10:46:34Z

@Frando
Copy link
Member Author

Frando commented Apr 3, 2025

I confirmed manually that this works. To reproduce:

  • Run a relay server with a config that contains access.allowlist = []
  • Let a node connect to that relay server:
    let relay_map = RelayMap::default_from_node("http://localhost:3340".parse()?, 3478);
    let endpoint = Endpoint::builder()
        .relay_mode(RelayMode::Custom(relay_map))
        .bind()
        .await?;
  • Watch the logs.

With this PR:

2025-04-03T11:12:29.099777Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:dialing: iroh::magicsock::relay_actor: Actor loop: con
necting to relay.
2025-04-03T11:12:29.100104Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:dialing: iroh::_events::relay::home_changed: url=http:
//localhost.:3340/ home_relay=true
2025-04-03T11:12:29.100579Z DEBUG ep{me=e288ca1203}:magicsock:actor: iroh::_events::direct_addrs: addrs=Some({DirectAddr { addr: 127.0.0.1:46592, typ: Stun }, DirectAddr
{ addr: 172.17.0.1:46592, typ: Local }, DirectAddr { addr: 172.18.0.1:46592, typ: Local }, DirectAddr { addr: 172.19.0.1:46592, typ: Local }, DirectAddr { addr: 192.168.1
78.107:46592, typ: Local }})
2025-04-03T11:12:29.101724Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Actor loop: c
onnected to relay
2025-04-03T11:12:29.101748Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::_events::relay::connected: url=http:/
/localhost.:3340/ home_relay=true
2025-04-03T11:12:29.109673Z  WARN ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Relay server
reports problem: not authenticated
2025-04-03T11:12:29.109799Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Connection to relay ser
ver lost: Client stream finished
2025-04-03T11:12:29.109821Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Retry in 10ms
2025-04-03T11:12:29.120649Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:dialing: iroh::magicsock::relay_actor: Actor loop: con
necting to relay.
2025-04-03T11:12:29.121717Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Actor loop: c
onnected to relay
2025-04-03T11:12:29.121739Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::_events::relay::connected: url=http:/
/localhost.:3340/ home_relay=true
2025-04-03T11:12:29.127776Z  WARN ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Relay server
reports problem: not authenticated
2025-04-03T11:12:29.128005Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Connection to relay ser
ver lost: Client stream finished
2025-04-03T11:12:29.128046Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Retry in 20ms
2025-04-03T11:12:29.149385Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:dialing: iroh::magicsock::relay_actor: Actor loop: con
necting to relay.
2025-04-03T11:12:29.151065Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Actor loop: c
onnected to relay
2025-04-03T11:12:29.151112Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::_events::relay::connected: url=http:/
/localhost.:3340/ home_relay=true
2025-04-03T11:12:29.159336Z  WARN ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Relay server
reports problem: not authenticated
2025-04-03T11:12:29.159484Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Connection to relay ser
ver lost: Client stream finished
2025-04-03T11:12:29.159521Z DEBUG ep{me=e288ca1203}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Retry in 39.999999ms

The backoff then goes up to 5s, and keeps retrying.

On main this would be reconnecting in a hot loop instead:

2025-04-03T11:18:30.417875Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Actor loop: c
onnected to relay
2025-04-03T11:18:30.417909Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::_events::relay::connected: url=http:/
/localhost.:3340/ home_relay=true
2025-04-03T11:18:30.425158Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Connection to relay ser
ver lost: Client stream finished
2025-04-03T11:18:30.425198Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:dialing: iroh::magicsock::relay_actor: Actor loop: con
necting to relay.
2025-04-03T11:18:30.426141Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Actor loop: c
onnected to relay
2025-04-03T11:18:30.426163Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::_events::relay::connected: url=http:/
/localhost.:3340/ home_relay=true
2025-04-03T11:18:30.432274Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Connection to relay ser
ver lost: Client stream finished
2025-04-03T11:18:30.432304Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:dialing: iroh::magicsock::relay_actor: Actor loop: con
necting to relay.
2025-04-03T11:18:30.433114Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Actor loop: c
onnected to relay
2025-04-03T11:18:30.433136Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::_events::relay::connected: url=http:/
/localhost.:3340/ home_relay=true
2025-04-03T11:18:30.439259Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}: iroh::magicsock::relay_actor: Connection to relay ser
ver lost: Client stream finished
2025-04-03T11:18:30.439288Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:dialing: iroh::magicsock::relay_actor: Actor loop: con
necting to relay.
2025-04-03T11:18:30.439883Z DEBUG ep{me=9b207a55a7}:magicsock:relay-actor:active-relay{url=http://localhost.:3340/}:connected: iroh::magicsock::relay_actor: Actor loop: c
onnected to relay

@Frando Frando changed the title fix: use backoff once relay connection fails after successful dial fix: backoff before retry if relay connection termiantes Apr 3, 2025
@Frando Frando changed the title fix: backoff before retry if relay connection termiantes fix: backoff before retry if relay connection terminates Apr 3, 2025
Copy link

github-actions bot commented Apr 3, 2025

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 58c8e3e

@n0bot n0bot bot added this to iroh Apr 3, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Apr 3, 2025
@matheus23 matheus23 added this to the 0.35.0 milestone Apr 4, 2025
@Frando Frando force-pushed the fix/relay-backoff branch from cf10a7a to a4e3c89 Compare April 4, 2025 07:58
@Frando Frando force-pushed the fix/relay-backoff branch from ebca398 to 82b14dc Compare April 4, 2025 08:14

while let Err(err) = self.run_once().await {
warn!("Relay connection failed: {err:#}");
if self.pong_received {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct would have been to put this in the error code. Something like:

enum RelayServerError {
    Connecting(...),
    Handshake(...),
    Established(...),
}

Then Connecting and Handshake would result in the next backoff value to be used. Established would trigger resetting the backoff timer.

My reason for this is that while this is a bit more complex on the error handling, it is less state on the struct. And that is usually more important I think, because the state on the struct can leak much easier to un-intended places and you needs to keep it in mind much more. To me it increases the cognitive load and complexity much more.

I'm not going to argue too hard for this though. I haven't actually tried to write this code so maybe my suggestion ends up worse anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion and went ahead and implemented this. Turned out quite OK.

@Frando Frando force-pushed the fix/relay-backoff branch from 44b7968 to 419d93c Compare April 4, 2025 10:04
Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for trying out my suggestions.

I consider with_jitter a blocking thing. But am sure that you can add that without needing an extra review :)


fn sleep(&self, dur: Duration) -> Self::Sleep {
time::sleep(dur)
// This is using `impl Future` to return a future without a reference to self.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, that's an interesting trick. Must try to remember this.

.with_min_delay(Duration::from_millis(10))
.with_max_delay(Duration::from_secs(16))
.without_max_times()
.build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the backon crate does not enable jitter by default. That's bad. We should add with_jitter

@Frando Frando added this pull request to the merge queue Apr 7, 2025
Merged via the queue into main with commit bc6e98c Apr 7, 2025
28 of 29 checks passed
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in iroh Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants