Bug: rendezvous server panics in response to libp2p_request_response::Event::Message #5997

scglenn · 2025-04-19T18:18:24Z

Summary

Currently running a rendezvous server that helps nodes bootstrap into a set of peers under a topic.

The rendezvous server handled up to about 1000 connections with no issue, but now we have seen this rendezvous server get a panic from within the libp2p-rendezvous crate.

The number of connections has increased to about 3k-5k and this issue started to show up. We removed pending connection and established connection limits and started to see this behavior.

thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:

This is the where the panic is occurring:

rust-libp2p/protocols/rendezvous/src/server.rs

Line 184 in 74e3157

.expect("Send response");

I'm unsure if the issue is related to the number of connections, but that's the only thing that has changed.

The comments suggest that self.inner.send_response will bubble up an error:
If the [ResponseChannel] is already closed due to a timeout or the connection being closed, the response is returned as an Err for further handling.
https://github.com/libp2p/rust-libp2p/blob/master/protocols/request-response/src/lib.rs#L484

Questions:

Is it correct behavior for self.inner.send_response to panic if a connection timed out or closed?
How are we supposed to handle this?
Any ideas as to why this could be happening?

Additional Note:

I would expect the system to be able to handle many more connections than this given i have seen others show data where nodes are handling 10k-40k connections.

Expected behavior

I would expect for the response to fail for that specific connection and for the rendezvous server to continue handling all the other connections that are happening rather than crashing

Actual behavior

It panics frequently, after a couple minutes of handling connections.

Relevant log output

`thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:`

Possible Solution

Perhaps the error should just be handled in a way that doesnt panic the entire swarm?

Version

libp2p=version = "0.53.2"

Would you like to work on fixing this bug?

Maybe

The text was updated successfully, but these errors were encountered:

scglenn · 2025-04-19T20:53:37Z

Additionally, i saw an interesting note from a previous issue:
"However, it is true that the peer could close the connection before it has a chance to respond to the request, though in this case it happened after the request is received while polling the request-response behaviour."

#5570 (comment)

@dariusc93 Perhaps this is similar?

dariusc93 · 2025-04-20T04:18:21Z

Additionally, i saw an interesting note from a previous issue: "However, it is true that the peer could close the connection before it has a chance to respond to the request, though in this case it happened after the request is received while polling the request-response behaviour."

#5570 (comment)

@dariusc93 Perhaps this is similar?

It's very possible. Could you provide any logs and your swarm configuration?

scglenn · 2025-04-20T21:56:10Z

p2psetup.txt

Perhaps this helps @dariusc93

the logs i have at this moment dont show much more insight other than the panic itself, I can enable debug and see if there is anything notable

Note: that i did test a deployment where i set connection limits to default, which i believe is just a max u32, so i dont think that is causing the issue

Additionally, does it make sense for a panic to occur here? i would expect connections to potentially expire or close at any time, but i wouldnt expect that to crash the swarm

dariusc93 · 2025-04-20T22:33:21Z

p2psetup.txt

Perhaps this helps @dariusc93

the logs i have at this moment dont show much more insight other than the panic itself, I can enable debug and see if there is anything notable

Note: that i did test a deployment where i set connection limits to default, which i believe is just a max u32, so i dont think that is causing the issue

Thanks. Though it should not be the issue, you might want to move the connection limit behaviour to the top so it is handled first when the behaviours are polled by swarm since polling them later in the tree can cause an inconsistent state if the connection is later denied while other behaviours accept the connection. See #4870. Besides that i dont see an exact cause in the code you provided so it may just be some inconsistency in the state of the behaviours.

scglenn · 2025-04-20T22:36:22Z

Thanks. Though it should not be the issue, you might want to move the connection limit behaviour to the top so it is handled first when the behaviours are polled by swarm since polling them later in the tree can cause an inconsistent state if the connection is later denied while other behaviours accept the connection. See #4870. Besides that i dont see an exact cause in the code you provided so it may just be some inconsistency in the state of the behaviours.

Perhaps i should move the rendezvous server behavior to the top with the connection limit behavior above it also? this way rendezvous server events get handled first?

scglenn · 2025-04-21T17:31:43Z

Moving connection limit and rendezvous server behaviors to the top has not fixed the panic issue so far

dariusc93 · 2025-04-21T20:37:00Z

Could you provide a log with RUST_LOG=debug used for when it panics?

scglenn · 2025-04-21T21:26:25Z

Could you provide a log with RUST_LOG=debug used for when it panics?

From the panic before we forked libp2p

Apr 17 00:48:22 ip-10-0-1-103 node[4634]: thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:
Apr 17 00:48:22 ip-10-0-1-103 node[4634]: Send response: DiscoverResponse(Ok(([Registration { namespace: Namespace(<redacted>), record: PeerRecord { peer_id: PeerId(<redacted>), seq: 1744850900, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): e153a5b4ffd298f7c934dbc11c91e5b6e97e1ad6984117a6d147d4a644a7b7a1) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 225, 83, 165, 180, 255, 210, 152, 247, 201, 52, 219, 193, 28, 145, 229, 182, 233, 126, 26, 214, 152, 65, 23, 166, 209, 71, 212, 166, 68, 167, 183, 161, 16, 212, 159, 129, 192, 6, 26, 10, 10, 8, 4, 62, 169, 24, 179, 6, 122, 81], signature: [168, 181, 223, 55, 194, 86, 7, 237, 139, 113, 177, 178, 114, 139, 221, 17, 56, 243, 111, 91, 190, 249, 44, 38, 128, 146, 6, 61, 50, 236, 64, 221, 174, 142, 131, 40, 183, 86, 182, 102, 248, 208, 93, 21, 137, 51, 50, 77, 254, 110, 129, 105, 252, 42, 36, 77, 174, 241, 98, 215, 114, 56, 15, 13] } }, ttl: 7200 }], Cookie { id: 4808417751412047776, namespace: Some(Namespace(<redacted>)) })))

We updated the rendezvous logic to not panic if a connection dropped and we see this (we added our own log that states "failed to send response" rather than panic:

$ sudo journalctl -u service --since "2025-04-21 20:09:00" -f | grep -i "failed to send response"
Apr 21 20:30:05 ip-10-0-1-103 drosera-seed-node[42328]: 2025-04-21T20:30:05.712449Z  WARN Swarm::poll: libp2p_rendezvous::server: Failed to send response: DiscoverResponse(Ok(([Registration { namespace: Namespace("<redacted>"), record: PeerRecord { peer_id: PeerId("<redacted>"), seq: 1745267404, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): 83ed6a1568123621ae8d21d84a9f63c23fb31872a47b53660ef8c8682f8ed31) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 131, 237, 106, 21, 104, 18, 54, 33, 174, 141, 33, 216, 74, 159, 99, 194, 63, 179, 24, 7, 42, 71, 181, 54, 96, 239, 140, 134, 130, 248, 237, 49, 16, 204, 213, 154, 192, 6, 26, 10, 10, 8, 4, 167, 86, 80, 63, 6, 122, 81], signature: [179, 153, 53, 19, 123, 37, 166, 35, 71, 14, 5, 160, 71, 9, 163, 78, 109, 152, 252, 79, 9, 82, 162, 7, 24, 52, 21, 242, 51, 55, 6, 127, 62, 254, 152, 205, 97, 170, 69, 28, 16, 37, 55, 34, 98, 76, 73, 107, 91, 240, 32, 245, 148, 55, 251, 229, 231, 241, 94, 0, 122, 249, 186, 3] } }, ttl: 7200 }], Cookie { id: 8618633838550913715, namespace: Some(Namespace("<redacted>")) }))) peer_id=<same as previous redacted>
Apr 21 20:35:05 ip-10-0-1-103 drosera-seed-node[42328]: 2025-04-21T20:35:05.775727Z  WARN Swarm::poll: libp2p_rendezvous::server: Failed to send response: DiscoverResponse(Ok(([Registration { namespace: Namespace("<redacted>"), record: PeerRecord { peer_id: PeerId("<redacted>"), seq: 1745267704, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): 83ed6a1568123621ae8d21d84a9f63c23fb31872a47b53660ef8c8682f8ed31) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 131, 237, 106, 21, 104, 18, 54, 33, 174, 141, 33, 216, 74, 159, 99, 194, 63, 179, 24, 7, 42, 71, 181, 54, 96, 239, 140, 134, 130, 248, 237, 49, 16, 248, 215, 154, 192, 6, 26, 10, 10, 8, 4, 167, 86, 80, 63, 6, 122, 81], signature: [6, 199, 249, 112, 29, 188, 211, 73, 96, 55, 142, 133, 204, 49, 187, 68, 237, 66, 216, 79, 248, 74, 169, 151, 51, 34, 73, 8, 28, 21, 101, 81, 194, 165, 165, 204, 226, 134, 123, 0, 143, 162, 156, 30, 251, 153, 248, 218, 8, 197, 175, 60, 238, 214, 10, 182, 160, 65, 44, 133, 228, 166, 6, 3] } }, ttl: 7200 }], Cookie { id: 6970538189100493197, namespace: Some(Namespace("<redacted>")) }))) peer_id=<redacted>

redacted a couple things but its all the same peer and same ip

Here is the change we made as a PR: #6002

Now we are no longer panicking with these changes

Log error instead of panicking when sending response to channel fails Related: #5997 Pull-Request: #6002.

0xkr8os mentioned this issue Apr 21, 2025

fix(rendezvous): log instead of panic when sending response to client #6002

Merged

4 tasks

mergify bot pushed a commit that referenced this issue May 1, 2025

fix(rendezvous): log instead of panic when sending response to client

85e5d0f

Log error instead of panicking when sending response to channel fails Related: #5997 Pull-Request: #6002.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: rendezvous server panics in response to libp2p_request_response::Event::Message #5997

Bug: rendezvous server panics in response to libp2p_request_response::Event::Message #5997

scglenn commented Apr 19, 2025 •

edited

Loading

scglenn commented Apr 19, 2025 •

edited

Loading

Uh oh!

dariusc93 commented Apr 20, 2025

Uh oh!

scglenn commented Apr 20, 2025 •

edited

Loading

Uh oh!

dariusc93 commented Apr 20, 2025

Uh oh!

scglenn commented Apr 20, 2025 •

edited

Loading

Uh oh!

scglenn commented Apr 21, 2025

Uh oh!

dariusc93 commented Apr 21, 2025

Uh oh!

scglenn commented Apr 21, 2025 •

edited

Loading

Uh oh!

Bug: rendezvous server panics in response to libp2p_request_response::Event::Message #5997

Bug: rendezvous server panics in response to libp2p_request_response::Event::Message #5997

Comments

scglenn commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Expected behavior

Actual behavior

Relevant log output

Possible Solution

Version

Would you like to work on fixing this bug?

scglenn commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dariusc93 commented Apr 20, 2025

Uh oh!

scglenn commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dariusc93 commented Apr 20, 2025

Uh oh!

scglenn commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scglenn commented Apr 21, 2025

Uh oh!

dariusc93 commented Apr 21, 2025

Uh oh!

scglenn commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scglenn commented Apr 19, 2025 •

edited

Loading

scglenn commented Apr 19, 2025 •

edited

Loading

scglenn commented Apr 20, 2025 •

edited

Loading

scglenn commented Apr 20, 2025 •

edited

Loading

scglenn commented Apr 21, 2025 •

edited

Loading