-
Notifications
You must be signed in to change notification settings - Fork 2
Add a timeout to inbound kad
substreams
#2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What about upstream PR? |
Happy to do that, if you think it will get a better response than the ticket? |
Upstreamed as libp2p#6009 |
return Poll::Pending; | ||
} | ||
Poll::Ready(Err(_)) => return Poll::Ready(None), | ||
// TODO: close here? (x2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense to close the current request here and reuse the substream for next one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upstream code already re-uses the substream slot here. If we are at the substream limit, the old substream ID is dropped, and a new one is used. This PR just makes more substreams available for re-use, by adding a timeout to the first request.
This TODO is about whether we need to explicitly call the close()
method on the substream. The old code didn't, and it isn't needed to fix the bug, so I'd like to wait for a review from upstream before making this change.
It might be that close()
does nothing, or nothing important. It's possible substreams don't have any resources, they are just an ID in a vector. And that's dropped when the limit is reached anyway.
/// How long before we give up on waiting for the first request on this stream. | ||
/// `None` if there has already been a request on this stream, or it has timed out and can | ||
/// be re-used. | ||
first_request_timeout: Option<Delay>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only the first request ?
Maybe I'm missing context here but ideally we timeout for any pending request and move to next one no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeouts are redundant on any requests after the first request on a substream. See commit 8053eb9, which removes the redundant timeout on the second and later requests.
Background
In the upstream code, substreams are already available for re-use as soon as the first request has finished. But there's no timeout upstream, so if the first request never arrives, the substream can never be re-used.
When the remote peer opens a substream, we know it will send at least one request, so this fix makes the inbound side wait 10 seconds for that first request. After the timeout, the substream can be re-used immediately. This matches the timeout and re-use behaviour on the outbound side.
Peer Behaviour
After the first request, there are two possible behaviours in the protocol that we need to handle. The peer can send another request on the same substream ID, or it can use a new substream ID.
To handle the case where the peer re-uses the same substream ID, we leave the substream ID slot available, unless we reach the substream limit.
To handle the case where the peer sends a new substream ID, we accept new substreams up to a limit (32).
When that limit is reached, we drop any timed out or used substream IDs immediately, and re-use that substream slot with the new ID.
This is why there is no timeout after the first request - we don't need one, because the substream can be re-used immediately. We'd just be adding timers and load for nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to wait for upstream before making any further changes beyond a timeout for the first request.
We know this code works and fixes the bug, and I don't want to accidentally break anything.
/// How long before we give up on waiting for the first request on this stream. | ||
/// `None` if there has already been a request on this stream, or it has timed out and can | ||
/// be re-used. | ||
first_request_timeout: Option<Delay>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeouts are redundant on any requests after the first request on a substream. See commit 8053eb9, which removes the redundant timeout on the second and later requests.
Background
In the upstream code, substreams are already available for re-use as soon as the first request has finished. But there's no timeout upstream, so if the first request never arrives, the substream can never be re-used.
When the remote peer opens a substream, we know it will send at least one request, so this fix makes the inbound side wait 10 seconds for that first request. After the timeout, the substream can be re-used immediately. This matches the timeout and re-use behaviour on the outbound side.
Peer Behaviour
After the first request, there are two possible behaviours in the protocol that we need to handle. The peer can send another request on the same substream ID, or it can use a new substream ID.
To handle the case where the peer re-uses the same substream ID, we leave the substream ID slot available, unless we reach the substream limit.
To handle the case where the peer sends a new substream ID, we accept new substreams up to a limit (32).
When that limit is reached, we drop any timed out or used substream IDs immediately, and re-use that substream slot with the new ID.
This is why there is no timeout after the first request - we don't need one, because the substream can be re-used immediately. We'd just be adding timers and load for nothing.
return Poll::Pending; | ||
} | ||
Poll::Ready(Err(_)) => return Poll::Ready(None), | ||
// TODO: close here? (x2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upstream code already re-uses the substream slot here. If we are at the substream limit, the old substream ID is dropped, and a new one is used. This PR just makes more substreams available for re-use, by adding a timeout to the first request.
This TODO is about whether we need to explicitly call the close()
method on the substream. The old code didn't, and it isn't needed to fix the bug, so I'd like to wait for a review from upstream before making this change.
It might be that close()
does nothing, or nothing important. It's possible substreams don't have any resources, they are just an ID in a vector. And that's dropped when the limit is reached anyway.
Description
This PR adds an inbound substream timeout to the
kad
protocol, which matches the outbound substream timeout. This prevents "substream limit exceeded" errors under load, caused by the outbound side timing out, but the inbound side keeping on waiting.This is a particular problem in the waiting for first request, waiting for behaviour response, pending send, pending flush, and closing states, because those substreams can't be re-used.
Fixes libp2p#3450
Upstream libp2p#5981
Notes & open questions
Should the substream be closed on a timeout?
Upstream doesn't close them on (most) substream errors, so this PR handles timeouts the same way.
Change checklist
A changelog entry has been made in the appropriate crates