Add a timeout to inbound `kad` substreams #2

teor2345 · 2025-04-10T05:16:28Z

Description

This PR adds an inbound substream timeout to the kad protocol, which matches the outbound substream timeout. This prevents "substream limit exceeded" errors under load, caused by the outbound side timing out, but the inbound side keeping on waiting.

This is a particular problem in the waiting for first request, waiting for behaviour response, pending send, pending flush, and closing states, because those substreams can't be re-used.

Fixes libp2p#3450
Upstream libp2p#5981

Notes & open questions

Should the substream be closed on a timeout?
Upstream doesn't close them on (most) substream errors, so this PR handles timeouts the same way.

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
~~A changelog entry has been made in the appropriate crates~~

nazar-pc · 2025-04-24T13:44:55Z

What about upstream PR?

teor2345 · 2025-04-25T01:46:25Z

What about upstream PR?

Happy to do that, if you think it will get a better response than the ticket?

teor2345 · 2025-04-25T03:28:40Z

Upstreamed as libp2p#6009

vedhavyas · 2025-04-28T04:27:29Z

protocols/kad/src/handler.rs

                            return Poll::Pending;
                        }
-                        Poll::Ready(Err(_)) => return Poll::Ready(None),
+                        // TODO: close here? (x2)


make sense to close the current request here and reuse the substream for next one

The upstream code already re-uses the substream slot here. If we are at the substream limit, the old substream ID is dropped, and a new one is used. This PR just makes more substreams available for re-use, by adding a timeout to the first request.

This TODO is about whether we need to explicitly call the close() method on the substream. The old code didn't, and it isn't needed to fix the bug, so I'd like to wait for a review from upstream before making this change.

It might be that close() does nothing, or nothing important. It's possible substreams don't have any resources, they are just an ID in a vector. And that's dropped when the limit is reached anyway.

vedhavyas · 2025-04-28T04:30:51Z

protocols/kad/src/handler.rs

+        /// How long before we give up on waiting for the first request on this stream.
+        /// `None` if there has already been a request on this stream, or it has timed out and can
+        /// be re-used.
+        first_request_timeout: Option<Delay>,


why only the first request ?
Maybe I'm missing context here but ideally we timeout for any pending request and move to next one no ?

Timeouts are redundant on any requests after the first request on a substream. See commit 8053eb9, which removes the redundant timeout on the second and later requests.

Background

In the upstream code, substreams are already available for re-use as soon as the first request has finished. But there's no timeout upstream, so if the first request never arrives, the substream can never be re-used.

When the remote peer opens a substream, we know it will send at least one request, so this fix makes the inbound side wait 10 seconds for that first request. After the timeout, the substream can be re-used immediately. This matches the timeout and re-use behaviour on the outbound side.

Peer Behaviour

After the first request, there are two possible behaviours in the protocol that we need to handle. The peer can send another request on the same substream ID, or it can use a new substream ID.

To handle the case where the peer re-uses the same substream ID, we leave the substream ID slot available, unless we reach the substream limit.

To handle the case where the peer sends a new substream ID, we accept new substreams up to a limit (32).
When that limit is reached, we drop any timed out or used substream IDs immediately, and re-use that substream slot with the new ID.

This is why there is no timeout after the first request - we don't need one, because the substream can be re-used immediately. We'd just be adding timers and load for nothing.

teor2345

I'd like to wait for upstream before making any further changes beyond a timeout for the first request.

We know this code works and fixes the bug, and I don't want to accidentally break anything.

teor2345 · 2025-04-28T08:35:32Z

protocols/kad/src/handler.rs

+        /// How long before we give up on waiting for the first request on this stream.
+        /// `None` if there has already been a request on this stream, or it has timed out and can
+        /// be re-used.
+        first_request_timeout: Option<Delay>,


Timeouts are redundant on any requests after the first request on a substream. See commit 8053eb9, which removes the redundant timeout on the second and later requests.

Background

In the upstream code, substreams are already available for re-use as soon as the first request has finished. But there's no timeout upstream, so if the first request never arrives, the substream can never be re-used.

When the remote peer opens a substream, we know it will send at least one request, so this fix makes the inbound side wait 10 seconds for that first request. After the timeout, the substream can be re-used immediately. This matches the timeout and re-use behaviour on the outbound side.

Peer Behaviour

After the first request, there are two possible behaviours in the protocol that we need to handle. The peer can send another request on the same substream ID, or it can use a new substream ID.

To handle the case where the peer re-uses the same substream ID, we leave the substream ID slot available, unless we reach the substream limit.

To handle the case where the peer sends a new substream ID, we accept new substreams up to a limit (32).
When that limit is reached, we drop any timed out or used substream IDs immediately, and re-use that substream slot with the new ID.

This is why there is no timeout after the first request - we don't need one, because the substream can be re-used immediately. We'd just be adding timers and load for nothing.

teor2345 · 2025-04-28T08:38:18Z

protocols/kad/src/handler.rs

                            return Poll::Pending;
                        }
-                        Poll::Ready(Err(_)) => return Poll::Ready(None),
+                        // TODO: close here? (x2)


The upstream code already re-uses the substream slot here. If we are at the substream limit, the old substream ID is dropped, and a new one is used. This PR just makes more substreams available for re-use, by adding a timeout to the first request.

This TODO is about whether we need to explicitly call the close() method on the substream. The old code didn't, and it isn't needed to fix the bug, so I'd like to wait for a review from upstream before making this change.

It might be that close() does nothing, or nothing important. It's possible substreams don't have any resources, they are just an ID in a vector. And that's dropped when the limit is reached anyway.

teor2345 added 3 commits April 10, 2025 14:45

Add a kad SUBSTREAM_TIMEOUT constant

c85c7cc

Add request, response, and closing timeouts on inbound streams

782a4ce

Only use a timeout for the first inbound kad message

8053eb9

teor2345 self-assigned this Apr 10, 2025

teor2345 added the bug Something isn't working label Apr 10, 2025

teor2345 mentioned this pull request Apr 23, 2025

Add a timeout to inbound libp2p-kad substreams autonomys/polkadot-sdk#29

Closed

teor2345 requested review from nazar-pc and vedhavyas April 23, 2025 04:42

teor2345 mentioned this pull request Apr 23, 2025

Add a timeout to inbound libp2p-kad substreams autonomys/subspace#3492

Merged

1 task

teor2345 mentioned this pull request Apr 25, 2025

kad exceeds substream limit due to outbound timeout, but no inbound timeout libp2p/rust-libp2p#5981

Open

vedhavyas reviewed Apr 28, 2025

View reviewed changes

teor2345 commented Apr 28, 2025

View reviewed changes

vedhavyas approved these changes Apr 28, 2025

View reviewed changes

teor2345 merged commit 399c4c7 into subspace-v9 Apr 28, 2025
11 of 69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a timeout to inbound `kad` substreams #2

Add a timeout to inbound `kad` substreams #2

Uh oh!

teor2345 commented Apr 10, 2025

Uh oh!

nazar-pc commented Apr 24, 2025

Uh oh!

teor2345 commented Apr 25, 2025

Uh oh!

teor2345 commented Apr 25, 2025

Uh oh!

vedhavyas Apr 28, 2025

Uh oh!

teor2345 Apr 28, 2025

Uh oh!

vedhavyas Apr 28, 2025

Uh oh!

teor2345 Apr 28, 2025

Uh oh!

teor2345 left a comment

Uh oh!

teor2345 Apr 28, 2025

Uh oh!

teor2345 Apr 28, 2025

Uh oh!

Uh oh!

Uh oh!

Add a timeout to inbound kad substreams #2

Add a timeout to inbound kad substreams #2

Uh oh!

Conversation

teor2345 commented Apr 10, 2025

Description

Notes & open questions

Change checklist

Uh oh!

nazar-pc commented Apr 24, 2025

Uh oh!

teor2345 commented Apr 25, 2025

Uh oh!

teor2345 commented Apr 25, 2025

Uh oh!

vedhavyas Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

teor2345 Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

vedhavyas Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

teor2345 Apr 28, 2025

Choose a reason for hiding this comment

Background

Peer Behaviour

Uh oh!

teor2345 left a comment

Choose a reason for hiding this comment

Uh oh!

teor2345 Apr 28, 2025

Choose a reason for hiding this comment

Background

Peer Behaviour

Uh oh!

teor2345 Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add a timeout to inbound `kad` substreams #2

Add a timeout to inbound `kad` substreams #2