-
Notifications
You must be signed in to change notification settings - Fork 236
fix: backoff before retry if relay connection terminates #3254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3254/docs/iroh/ Last updated: 2025-04-07T10:46:34Z |
I confirmed manually that this works. To reproduce:
let relay_map = RelayMap::default_from_node("http://localhost:3340".parse()?, 3478);
let endpoint = Endpoint::builder()
.relay_mode(RelayMode::Custom(relay_map))
.bind()
.await?;
With this PR:
The backoff then goes up to 5s, and keeps retrying. On
|
cf10a7a
to
a4e3c89
Compare
ebca398
to
82b14dc
Compare
iroh/src/magicsock/relay_actor.rs
Outdated
|
||
while let Err(err) = self.run_once().await { | ||
warn!("Relay connection failed: {err:#}"); | ||
if self.pong_received { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct would have been to put this in the error code. Something like:
enum RelayServerError {
Connecting(...),
Handshake(...),
Established(...),
}
Then Connecting
and Handshake
would result in the next backoff value to be used. Established
would trigger resetting the backoff timer.
My reason for this is that while this is a bit more complex on the error handling, it is less state on the struct. And that is usually more important I think, because the state on the struct can leak much easier to un-intended places and you needs to keep it in mind much more. To me it increases the cognitive load and complexity much more.
I'm not going to argue too hard for this though. I haven't actually tried to write this code so maybe my suggestion ends up worse anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the suggestion and went ahead and implemented this. Turned out quite OK.
44b7968
to
419d93c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for trying out my suggestions.
I consider with_jitter
a blocking thing. But am sure that you can add that without needing an extra review :)
|
||
fn sleep(&self, dur: Duration) -> Self::Sleep { | ||
time::sleep(dur) | ||
// This is using `impl Future` to return a future without a reference to self. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, that's an interesting trick. Must try to remember this.
.with_min_delay(Duration::from_millis(10)) | ||
.with_max_delay(Duration::from_secs(16)) | ||
.without_max_times() | ||
.build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the backon crate does not enable jitter by default. That's bad. We should add with_jitter
Description
Currently, if the relay server aborts a connection, we would reconnect to the relay server right away. If the relay server happens to always abort the connection right after establishing, we would be retrying in a hot loop without pause. This happens reliably if the relay server aborts the connection because it rejects the client as not authorized.
This PR has two changes:
Health
problem, which it does for unauthenticated clients right before closing the connection, this is logged withWARN
levelBreaking Changes
Notes & open questions
Change checklist
quic-rpc
iroh-gossip
iroh-blobs
dumbpipe
sendme