Skip to content

Add a timeout to inbound kad substreams #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 28, 2025

Conversation

teor2345
Copy link
Member

Description

This PR adds an inbound substream timeout to the kad protocol, which matches the outbound substream timeout. This prevents "substream limit exceeded" errors under load, caused by the outbound side timing out, but the inbound side keeping on waiting.

This is a particular problem in the waiting for first request, waiting for behaviour response, pending send, pending flush, and closing states, because those substreams can't be re-used.

Fixes libp2p#3450
Upstream libp2p#5981

Notes & open questions

Should the substream be closed on a timeout?
Upstream doesn't close them on (most) substream errors, so this PR handles timeouts the same way.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@nazar-pc
Copy link
Member

What about upstream PR?

@teor2345
Copy link
Member Author

What about upstream PR?

Happy to do that, if you think it will get a better response than the ticket?

@teor2345
Copy link
Member Author

Upstreamed as libp2p#6009

return Poll::Pending;
}
Poll::Ready(Err(_)) => return Poll::Ready(None),
// TODO: close here? (x2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to close the current request here and reuse the substream for next one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream code already re-uses the substream slot here. If we are at the substream limit, the old substream ID is dropped, and a new one is used. This PR just makes more substreams available for re-use, by adding a timeout to the first request.

This TODO is about whether we need to explicitly call the close() method on the substream. The old code didn't, and it isn't needed to fix the bug, so I'd like to wait for a review from upstream before making this change.

It might be that close() does nothing, or nothing important. It's possible substreams don't have any resources, they are just an ID in a vector. And that's dropped when the limit is reached anyway.

/// How long before we give up on waiting for the first request on this stream.
/// `None` if there has already been a request on this stream, or it has timed out and can
/// be re-used.
first_request_timeout: Option<Delay>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only the first request ?
Maybe I'm missing context here but ideally we timeout for any pending request and move to next one no ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts are redundant on any requests after the first request on a substream. See commit 8053eb9, which removes the redundant timeout on the second and later requests.

Background

In the upstream code, substreams are already available for re-use as soon as the first request has finished. But there's no timeout upstream, so if the first request never arrives, the substream can never be re-used.

When the remote peer opens a substream, we know it will send at least one request, so this fix makes the inbound side wait 10 seconds for that first request. After the timeout, the substream can be re-used immediately. This matches the timeout and re-use behaviour on the outbound side.

Peer Behaviour

After the first request, there are two possible behaviours in the protocol that we need to handle. The peer can send another request on the same substream ID, or it can use a new substream ID.

To handle the case where the peer re-uses the same substream ID, we leave the substream ID slot available, unless we reach the substream limit.

To handle the case where the peer sends a new substream ID, we accept new substreams up to a limit (32).
When that limit is reached, we drop any timed out or used substream IDs immediately, and re-use that substream slot with the new ID.

This is why there is no timeout after the first request - we don't need one, because the substream can be re-used immediately. We'd just be adding timers and load for nothing.

Copy link
Member Author

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to wait for upstream before making any further changes beyond a timeout for the first request.

We know this code works and fixes the bug, and I don't want to accidentally break anything.

/// How long before we give up on waiting for the first request on this stream.
/// `None` if there has already been a request on this stream, or it has timed out and can
/// be re-used.
first_request_timeout: Option<Delay>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts are redundant on any requests after the first request on a substream. See commit 8053eb9, which removes the redundant timeout on the second and later requests.

Background

In the upstream code, substreams are already available for re-use as soon as the first request has finished. But there's no timeout upstream, so if the first request never arrives, the substream can never be re-used.

When the remote peer opens a substream, we know it will send at least one request, so this fix makes the inbound side wait 10 seconds for that first request. After the timeout, the substream can be re-used immediately. This matches the timeout and re-use behaviour on the outbound side.

Peer Behaviour

After the first request, there are two possible behaviours in the protocol that we need to handle. The peer can send another request on the same substream ID, or it can use a new substream ID.

To handle the case where the peer re-uses the same substream ID, we leave the substream ID slot available, unless we reach the substream limit.

To handle the case where the peer sends a new substream ID, we accept new substreams up to a limit (32).
When that limit is reached, we drop any timed out or used substream IDs immediately, and re-use that substream slot with the new ID.

This is why there is no timeout after the first request - we don't need one, because the substream can be re-used immediately. We'd just be adding timers and load for nothing.

return Poll::Pending;
}
Poll::Ready(Err(_)) => return Poll::Ready(None),
// TODO: close here? (x2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream code already re-uses the substream slot here. If we are at the substream limit, the old substream ID is dropped, and a new one is used. This PR just makes more substreams available for re-use, by adding a timeout to the first request.

This TODO is about whether we need to explicitly call the close() method on the substream. The old code didn't, and it isn't needed to fix the bug, so I'd like to wait for a review from upstream before making this change.

It might be that close() does nothing, or nothing important. It's possible substreams don't have any resources, they are just an ID in a vector. And that's dropped when the limit is reached anyway.

@teor2345 teor2345 merged commit 399c4c7 into subspace-v9 Apr 28, 2025
11 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants