Skip to content

Bug: rendezvous server panics in response to libp2p_request_response::Event::Message #5997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
scglenn opened this issue Apr 19, 2025 · 8 comments

Comments

@scglenn
Copy link

scglenn commented Apr 19, 2025

Summary

Currently running a rendezvous server that helps nodes bootstrap into a set of peers under a topic.

The rendezvous server handled up to about 1000 connections with no issue, but now we have seen this rendezvous server get a panic from within the libp2p-rendezvous crate.

The number of connections has increased to about 3k-5k and this issue started to show up. We removed pending connection and established connection limits and started to see this behavior.

thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:

This is the where the panic is occurring:

.expect("Send response");

Image

I'm unsure if the issue is related to the number of connections, but that's the only thing that has changed.

The comments suggest that self.inner.send_response will bubble up an error:
If the [ResponseChannel] is already closed due to a timeout or the connection being closed, the response is returned as an Err for further handling.
https://github.com/libp2p/rust-libp2p/blob/master/protocols/request-response/src/lib.rs#L484

Image

Questions:

  • Is it correct behavior for self.inner.send_response to panic if a connection timed out or closed?
  • How are we supposed to handle this?
  • Any ideas as to why this could be happening?

Additional Note:

  • I would expect the system to be able to handle many more connections than this given i have seen others show data where nodes are handling 10k-40k connections.

Expected behavior

I would expect for the response to fail for that specific connection and for the rendezvous server to continue handling all the other connections that are happening rather than crashing

Actual behavior

It panics frequently, after a couple minutes of handling connections.

Relevant log output

`thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:`

Possible Solution

Perhaps the error should just be handled in a way that doesnt panic the entire swarm?

Version

libp2p=version = "0.53.2"

Would you like to work on fixing this bug?

Maybe

@scglenn
Copy link
Author

scglenn commented Apr 19, 2025

Additionally, i saw an interesting note from a previous issue:
"However, it is true that the peer could close the connection before it has a chance to respond to the request, though in this case it happened after the request is received while polling the request-response behaviour."

#5570 (comment)

@dariusc93 Perhaps this is similar?

@dariusc93
Copy link
Member

Additionally, i saw an interesting note from a previous issue: "However, it is true that the peer could close the connection before it has a chance to respond to the request, though in this case it happened after the request is received while polling the request-response behaviour."

#5570 (comment)

@dariusc93 Perhaps this is similar?

It's very possible. Could you provide any logs and your swarm configuration?

@scglenn
Copy link
Author

scglenn commented Apr 20, 2025

p2psetup.txt

Perhaps this helps @dariusc93

the logs i have at this moment dont show much more insight other than the panic itself, I can enable debug and see if there is anything notable

Note: that i did test a deployment where i set connection limits to default, which i believe is just a max u32, so i dont think that is causing the issue

Additionally, does it make sense for a panic to occur here? i would expect connections to potentially expire or close at any time, but i wouldnt expect that to crash the swarm

@dariusc93
Copy link
Member

p2psetup.txt

Perhaps this helps @dariusc93

the logs i have at this moment dont show much more insight other than the panic itself, I can enable debug and see if there is anything notable

Note: that i did test a deployment where i set connection limits to default, which i believe is just a max u32, so i dont think that is causing the issue

Thanks. Though it should not be the issue, you might want to move the connection limit behaviour to the top so it is handled first when the behaviours are polled by swarm since polling them later in the tree can cause an inconsistent state if the connection is later denied while other behaviours accept the connection. See #4870. Besides that i dont see an exact cause in the code you provided so it may just be some inconsistency in the state of the behaviours.

@scglenn
Copy link
Author

scglenn commented Apr 20, 2025

Thanks. Though it should not be the issue, you might want to move the connection limit behaviour to the top so it is handled first when the behaviours are polled by swarm since polling them later in the tree can cause an inconsistent state if the connection is later denied while other behaviours accept the connection. See #4870. Besides that i dont see an exact cause in the code you provided so it may just be some inconsistency in the state of the behaviours.

Perhaps i should move the rendezvous server behavior to the top with the connection limit behavior above it also? this way rendezvous server events get handled first?

@scglenn
Copy link
Author

scglenn commented Apr 21, 2025

Moving connection limit and rendezvous server behaviors to the top has not fixed the panic issue so far

@dariusc93
Copy link
Member

Could you provide a log with RUST_LOG=debug used for when it panics?

@scglenn
Copy link
Author

scglenn commented Apr 21, 2025

Could you provide a log with RUST_LOG=debug used for when it panics?

From the panic before we forked libp2p

Apr 17 00:48:22 ip-10-0-1-103 node[4634]: thread 'tokio-runtime-worker' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/libp2p-rendezvous-0.14.0/src/server.rs:184:38:
Apr 17 00:48:22 ip-10-0-1-103 node[4634]: Send response: DiscoverResponse(Ok(([Registration { namespace: Namespace(<redacted>), record: PeerRecord { peer_id: PeerId(<redacted>), seq: 1744850900, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): e153a5b4ffd298f7c934dbc11c91e5b6e97e1ad6984117a6d147d4a644a7b7a1) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 225, 83, 165, 180, 255, 210, 152, 247, 201, 52, 219, 193, 28, 145, 229, 182, 233, 126, 26, 214, 152, 65, 23, 166, 209, 71, 212, 166, 68, 167, 183, 161, 16, 212, 159, 129, 192, 6, 26, 10, 10, 8, 4, 62, 169, 24, 179, 6, 122, 81], signature: [168, 181, 223, 55, 194, 86, 7, 237, 139, 113, 177, 178, 114, 139, 221, 17, 56, 243, 111, 91, 190, 249, 44, 38, 128, 146, 6, 61, 50, 236, 64, 221, 174, 142, 131, 40, 183, 86, 182, 102, 248, 208, 93, 21, 137, 51, 50, 77, 254, 110, 129, 105, 252, 42, 36, 77, 174, 241, 98, 215, 114, 56, 15, 13] } }, ttl: 7200 }], Cookie { id: 4808417751412047776, namespace: Some(Namespace(<redacted>)) })))

We updated the rendezvous logic to not panic if a connection dropped and we see this (we added our own log that states "failed to send response" rather than panic:

$ sudo journalctl -u service --since "2025-04-21 20:09:00" -f | grep -i "failed to send response"
Apr 21 20:30:05 ip-10-0-1-103 drosera-seed-node[42328]: 2025-04-21T20:30:05.712449Z  WARN Swarm::poll: libp2p_rendezvous::server: Failed to send response: DiscoverResponse(Ok(([Registration { namespace: Namespace("<redacted>"), record: PeerRecord { peer_id: PeerId("<redacted>"), seq: 1745267404, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): 83ed6a1568123621ae8d21d84a9f63c23fb31872a47b53660ef8c8682f8ed31) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 131, 237, 106, 21, 104, 18, 54, 33, 174, 141, 33, 216, 74, 159, 99, 194, 63, 179, 24, 7, 42, 71, 181, 54, 96, 239, 140, 134, 130, 248, 237, 49, 16, 204, 213, 154, 192, 6, 26, 10, 10, 8, 4, 167, 86, 80, 63, 6, 122, 81], signature: [179, 153, 53, 19, 123, 37, 166, 35, 71, 14, 5, 160, 71, 9, 163, 78, 109, 152, 252, 79, 9, 82, 162, 7, 24, 52, 21, 242, 51, 55, 6, 127, 62, 254, 152, 205, 97, 170, 69, 28, 16, 37, 55, 34, 98, 76, 73, 107, 91, 240, 32, 245, 148, 55, 251, 229, 231, 241, 94, 0, 122, 249, 186, 3] } }, ttl: 7200 }], Cookie { id: 8618633838550913715, namespace: Some(Namespace("<redacted>")) }))) peer_id=<same as previous redacted>
Apr 21 20:35:05 ip-10-0-1-103 drosera-seed-node[42328]: 2025-04-21T20:35:05.775727Z  WARN Swarm::poll: libp2p_rendezvous::server: Failed to send response: DiscoverResponse(Ok(([Registration { namespace: Namespace("<redacted>"), record: PeerRecord { peer_id: PeerId("<redacted>"), seq: 1745267704, addresses: [<redacted>], envelope: SignedEnvelope { key: PublicKey { publickey: Ed25519(PublicKey(compressed): 83ed6a1568123621ae8d21d84a9f63c23fb31872a47b53660ef8c8682f8ed31) }, payload_type: [47, 108, 105, 98, 112, 50, 112, 47, 114, 111, 117, 116, 105, 110, 103, 45, 115, 116, 97, 116, 101, 45, 114, 101, 99, 111, 114, 100], payload: [10, 38, 0, 36, 8, 1, 18, 32, 131, 237, 106, 21, 104, 18, 54, 33, 174, 141, 33, 216, 74, 159, 99, 194, 63, 179, 24, 7, 42, 71, 181, 54, 96, 239, 140, 134, 130, 248, 237, 49, 16, 248, 215, 154, 192, 6, 26, 10, 10, 8, 4, 167, 86, 80, 63, 6, 122, 81], signature: [6, 199, 249, 112, 29, 188, 211, 73, 96, 55, 142, 133, 204, 49, 187, 68, 237, 66, 216, 79, 248, 74, 169, 151, 51, 34, 73, 8, 28, 21, 101, 81, 194, 165, 165, 204, 226, 134, 123, 0, 143, 162, 156, 30, 251, 153, 248, 218, 8, 197, 175, 60, 238, 214, 10, 182, 160, 65, 44, 133, 228, 166, 6, 3] } }, ttl: 7200 }], Cookie { id: 6970538189100493197, namespace: Some(Namespace("<redacted>")) }))) peer_id=<redacted>

redacted a couple things but its all the same peer and same ip

Here is the change we made as a PR: #6002

Now we are no longer panicking with these changes

mergify bot pushed a commit that referenced this issue May 1, 2025
Log error instead of panicking when sending response to channel fails

Related: #5997

Pull-Request: #6002.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants