Substream limits exceeded #1504

mikevoronov · 2020-03-23T11:22:57Z

I have stumbled with problem of reaching the maximum number of substreams inside multiplexors.

I have complex test, and will try to describe it: from 20 nodes using libp2p-kad for connectivity I constructed 176 pairs (with tuple_combinations), then created 176 clients connected to this nodes. In each pair one client sends some info to other client. Each node has two separated services based on different libp2p's swarms: one for clients, other for nodes in the network. Node swarm behaviour combines kademlia, identity, ping and my own protocol. My own protocol is pretty simple and is based on the OneShotHandler. Peer swarm just controls clients and sends requests from them to node swarm. Clients are separated entities and also built on rust-libp2p.

Node swarm after receiveing request from a client uses the get providers capability of libp2p-kad for finding other client (second in the corresponding pair). After that it tries to connect to it by my own protocol and send some short message. So the flow looks like this:
client_1 -> node_1 {client swarm -> node swarm (FindProviders-Connect-Send)} -> node_2 {node swarm -> client swarm} -> client_2.

The first client from each pair sends 100 message to the second client. All pairs are placed in FuturesUnordered so sending beteen pairs is parallel but inside one pair is sequential - the test waits for recevieng message by the second client with timeout 5 sec.

All of these works fine (without any timeouts) on my 176 pairs only If I increase a number of maximum substreams on multiplexors (both mplex and yamux). But after several tries it stucks also after reaching the set maximum of substreams or smth like that - I have seen several corresponding errors of reaching substream limits from mplex in logs. And for example if I increase substream counts from default 128 for mplex to 1024 the test completes thefirst time successfully. Also from logs it could be seen that message is successfully passed to node swarm of the node_1 connected to the client_1.

I have instrumented mplex a little bit and it seems that finding providers generates several new substreams (5-6 on my network) but only one of them will be closed.

All of this looks like the root case in the stream management on the libp2p-kad side, but after very quick analysis couldn't find the cause.

Also very rarely I have such panic:

thread 'libp2p-task-1' panicked at 'Sink::poll_ready applied to a closed sink.', <::std::macros::panic macros>:2:4
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1052
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:224
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:472
  11: std::panicking::begin_panic
  12: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
  13: <rw_stream_sink::RwStreamSink<S> as futures_io::if_std::AsyncWrite>::poll_write
  14: futures_util::compat::compat03as01::io::<impl std::io::Write for futures_util::compat::compat03as01::Compat<W>>::write
  15: tokio_io::async_write::AsyncWrite::poll_write
  16: futures::task_impl::std::set
  17: <futures_codec::framed_write::FramedWrite2<T> as futures_sink::Sink<<T as futures_codec::encoder::Encoder>::Item>>::poll_flush
  18: <libp2p_mplex::Multiplex<C> as libp2p_core::muxing::StreamMuxer>::flush_substream
  19: <libp2p_core::muxing::Wrap<T> as libp2p_core::muxing::StreamMuxer>::flush_substream
  20: futures_util::compat::compat03as01::io::<impl std::io::Write for futures_util::compat::compat03as01::Compat<W>>::flush
  21: tokio_io::async_write::AsyncWrite::poll_flush
  22: futures::task_impl::std::set
  23: <futures_codec::framed_write::FramedWrite2<T> as futures_sink::Sink<<T as futures_codec::encoder::Encoder>::Item>>::poll_flush
  24: <futures_util::sink::with::With<Si,Item,U,Fut,F> as futures_sink::Sink<U>>::poll_flush
  25: <libp2p_kad::handler::KademliaHandler<TUserData> as libp2p_swarm::protocols_handler::ProtocolsHandler>::poll
  26: <libp2p_swarm::protocols_handler::node_handler::NodeHandlerWrapper<TProtoHandler> as libp2p_core::nodes::handled_node::NodeHandler>::poll
  27: libp2p_core::nodes::handled_node::HandledNode<TMuxer,THandler>::poll
  28: <libp2p_core::nodes::tasks::task::Task<F,M,H,I,O,E,C> as core::future::future::Future>::poll
  29: futures_executor::thread_pool::PoolState::work
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The text was updated successfully, but these errors were encountered:

twittner · 2020-03-24T11:10:02Z

It would be tremendously helpful if you could provide some sort of test case that makes it possible to reproduce your problems. Could you also provide the specific errors you see or some logs? And about the panic ― do you see this with both multiplexer implementations or only with mplex?

mikevoronov · 2020-03-29T18:38:13Z

Hi, thank you for the comment, but unfortunately I can't share the project now, and POC segregation will take much time. I thought that given the fact that there are no any projects that use get providers feature explicitly (maybe I am wrong, but can't find on the github and in the list of notable users) and there are some quickly detectable inconveniences (#1507, #1526), the problem here could be also easily discoverable...

On default libp2p with debug log level I see Refused substream; reached maximum number of substreams with corresponded error.

Panic is appeared only on mplex and I believe now that it isn't linked with reaching the maximum substreams count.

Issues libp2p#1504 and libp2p#1523 reported panics caused by polling the sink of `secio::LenPrefixCodec` after it had entered its terminal state, i.e. after it had previously encountered an error or was closed. According to the reports this happened only when using mplex as a stream multiplexer. It seems that because mplex always stores and keeps the `Waker` when polling, a wakeup of any of those wakers will resume the polling even for those cases where the previous poll did not return `Poll::Pending` but resolved to a value. To prevent polling after the connection was closed or an error happened we check for those conditions prior to every poll.

twittner · 2020-03-31T10:38:23Z

Panic is appeared only on mplex and I believe now that it isn't linked with reaching the maximum substreams count.

@michaelvoronov: Would you mind testing #1529 to see if the panic is gone?

mikevoronov · 2020-03-31T10:49:28Z

@twittner thanks, will try to do so today or tomorrow.

* mplex: Check for error and shutdown. Issues #1504 and #1523 reported panics caused by polling the sink of `secio::LenPrefixCodec` after it had entered its terminal state, i.e. after it had previously encountered an error or was closed. According to the reports this happened only when using mplex as a stream multiplexer. It seems that because mplex always stores and keeps the `Waker` when polling, a wakeup of any of those wakers will resume the polling even for those cases where the previous poll did not return `Poll::Pending` but resolved to a value. To prevent polling after the connection was closed or an error happened we check for those conditions prior to every poll. * Keep error when operations fail. Co-authored-by: Pierre Krieger <[email protected]>

mikevoronov · 2020-04-06T13:10:14Z

@twittner I've tested with this complex test several times - there weren't any panics on mplex.

tomaka · 2020-04-29T13:53:36Z

I guess this is resolved? 🎉

vldm mentioned this issue Mar 26, 2020

Panic in deep of libp2p Sink::poll_ready applied to a closed sink. #1523

Closed

twittner mentioned this issue Mar 30, 2020

mplex: Check for error and shutdown. #1529

Merged

tomaka closed this as completed Apr 29, 2020

romanb mentioned this issue Jul 1, 2020

Random upgrade timeouts between peers #1629

Closed

romanb mentioned this issue Sep 22, 2020

[mplex] Refactoring with Patches #1769

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Substream limits exceeded #1504

Substream limits exceeded #1504

mikevoronov commented Mar 23, 2020 •

edited

Loading

twittner commented Mar 24, 2020

mikevoronov commented Mar 29, 2020 •

edited

Loading

twittner commented Mar 31, 2020

mikevoronov commented Mar 31, 2020

mikevoronov commented Apr 6, 2020

tomaka commented Apr 29, 2020

Substream limits exceeded #1504

Substream limits exceeded #1504

Comments

mikevoronov commented Mar 23, 2020 • edited Loading

twittner commented Mar 24, 2020

mikevoronov commented Mar 29, 2020 • edited Loading

twittner commented Mar 31, 2020

mikevoronov commented Mar 31, 2020

mikevoronov commented Apr 6, 2020

tomaka commented Apr 29, 2020

mikevoronov commented Mar 23, 2020 •

edited

Loading

mikevoronov commented Mar 29, 2020 •

edited

Loading