Skip to content

Deadlock issue in futures/tokio? #1170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
StanislavGlebik opened this issue Aug 3, 2018 · 10 comments
Closed

Deadlock issue in futures/tokio? #1170

StanislavGlebik opened this issue Aug 3, 2018 · 10 comments

Comments

@StanislavGlebik
Copy link

Flagging the issue that we've ran into recently. We have a binary that waits for a future to complete via Runtime::block_on() - https://goo.gl/tmVfCU, and from time to time it waits forever. From the logs it was clear that the future has actually completed, but this Runtime::block_on() method never exits.

Below is the stacktrace. What was interesting is that there was just one thread running when we've noticed the deadlock.

#0  0x00007f1039e0180f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/local/fbcode/gcc-5-glibc-2.23/lib/libpthread.so.0

#1  0x000055af8bb52c0b in wait (self=<optimized out>, mutex=<optimized out>) at /home/engshare/third-party2/rust/1.27.0/src/rust/src/libstd/sys/unix/condvar.rs:78

#2  wait (self=0x7f0fbb36c070, mutex=0x7f0fbb36c040) at /home/engshare/third-party2/rust/1.27.0/src/rust/src/libstd/sys_common/condvar.rs:51

#3  wait<()> (self=0x7f0fbb36d6e8, guard=...) at /home/engshare/third-party2/rust/1.27.0/src/rust/src/libstd/sync/condvar.rs:214

#4  futures::task_impl::std::ThreadNotify::park::h0aa67b83d2364aa3 (self=0x7f0fbb36d6d0) at vendor/futures/src/task_impl/std/mod.rs:536

#5  0x000055af8b658f8f in {{closure}}<futures::sync::oneshot::Receiver<core::result::Result<(), failure::error::Error>>> (notify=<optimized out>)
    at /home/engshare/third-party2/rust-crates-io/0.0.0/src/build-gcc-5-glibc-2.23/build/crates.io/vendor/futures/src/task_impl/std/mod.rs:244

#6  {{closure}}<closure,core::result::Result<core::result::Result<(), failure::error::Error>, futures::sync::oneshot::Canceled>> (notify=<optimized out>)
    at /home/engshare/third-party2/rust-crates-io/0.0.0/src/build-gcc-5-glibc-2.23/build/crates.io/vendor/futures/src/task_impl/std/mod.rs:506

#7  try_with<alloc::arc::Arc<futures::task_impl::std::ThreadNotify>,closure,core::result::Result<core::result::Result<(), failure::error::Error>, futures::sync::oneshot::Canceled>> (
    self=<optimized out>, f=...) at /home/engshare/third-party2/rust/1.27.0/src/rust/src/libstd/thread/local.rs:294

#8  with<alloc::arc::Arc<futures::task_impl::std::ThreadNotify>,closure,core::result::Result<core::result::Result<(), failure::error::Error>, futures::sync::oneshot::Canceled>> (self=<optimized out>,
    f=...) at /home/engshare/third-party2/rust/1.27.0/src/rust/src/libstd/thread/local.rs:248

#9  with_current<closure,core::result::Result<core::result::Result<(), failure::error::Error>, futures::sync::oneshot::Canceled>> (f=...)
    at /home/engshare/third-party2/rust-crates-io/0.0.0/src/build-gcc-5-glibc-2.23/build/crates.io/vendor/futures/src/task_impl/std/mod.rs:506

#10 wait_future<futures::sync::oneshot::Receiver<core::result::Result<(), failure::error::Error>>> (self=<optimized out>)
    at /home/engshare/third-party2/rust-crates-io/0.0.0/src/build-gcc-5-glibc-2.23/build/crates.io/vendor/futures/src/task_impl/std/mod.rs:240

#11 wait<futures::sync::oneshot::Receiver<core::result::Result<(), failure::error::Error>>> (self=...)
    at /home/engshare/third-party2/rust-crates-io/0.0.0/src/build-gcc-5-glibc-2.23/build/crates.io/vendor/futures/src/future/mod.rs:299

#12 block_on<futures::future::map_err::MapErr<alloc::boxed::Box<Future>, closure>,(),failure::error::Error> (self=<optimized out>, future=...)
    at /home/engshare/third-party2/rust-crates-io/0.0.0/src/build-gcc-5-glibc-2.23/build/crates.io/vendor/tokio/src/runtime/mod.rs:389

#13 blobimport::main::h79db5a5e920deff5 () at scm/mononoke/cmds/blobimport.rs:102

#14 0x000055af8b6530c7 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h0a39116bae72204c () at /home/engshare/third-party2/rust/1.27.0/src/rust/src/libstd/rt.rs:74

#15 0x000055af8bbe2503 in {{closure}} () at libstd/rt.rs:59

#16 std::panicking::try::do_call::h08033a34fd60edd4 (data=0x7ffc058e2e50 "\200.\216\005\374\177") at libstd/panicking.rs:310
#17 0x000055af8bbec9fa in __rust_maybe_catch_panic (f=0xfffffffffffffe00, data=0x7f0fbb36c074 "\001", data_ptr=0x7ffc058e2e78, vtable_ptr=0x7ffc058e2e70) at libpanic_unwind/lib.rs:105

#18 0x000055af8bbe2d3d in try<i32,closure> (f=...) at libstd/panicking.rs:289

#19 catch_unwind<closure,i32> (f=...) at libstd/panic.rs:374

#20 std::rt::lang_start_internal::h18d8fb7455b57bc0 (main=..., argc=18, argv=0x7ffc058e3068) at libstd/rt.rs:58

#21 0x000055af8b65952a in main ()

Has anybody ran into similar issues?

@cramertj
Copy link
Member

cramertj commented Aug 3, 2018

Interesting! I think this belongs as an issue on https://github.com/tokio-rs/tokio, though, since the block_on function here is from tokio's reactor, not the one in the futures repo.

@jsgf
Copy link

jsgf commented Aug 3, 2018

@cramertj From the stack trace, its blocked waiting for the Receiver side of a sync::oneshot channel, which is in futures, and it seems to be futures all the way down to the unwakeable condvar wait.

@cramertj
Copy link
Member

cramertj commented Aug 3, 2018

@jsgf If you suspect there's a bug in sync::oneshot or another futures utility, I'm happy to take a look. Are you seeing suspicious behavior coming from one of the futures utilities involved? Blocking on the condition variable seems like a totally reasonable thing to be doing if no value has arrived on the other side.

@jsgf
Copy link

jsgf commented Aug 3, 2018

In this case the code is from tokio::runtime::block_on:

        let (tx, rx) = futures::sync::oneshot::channel();
        self.spawn(future.then(move |r| tx.send(r).map_err(|_| unreachable!())));
        rx.wait().unwrap() // <- blocked here

We're blocked in pthread_cond_wait as part of rx.wait() - but there are no other threads existent at all, so there's nothing to wake it up.

What are the threadpool and reactor thread lifetimes? Would the exit if there's nothing that needs them?

That means that in the self.spawn, either:

  1. the Future just deadlocked, and stopped either needing an executor thread or an IO reactor thread, or
  2. it finished, and the tx.send() happened, but the rx.wait() didn't wake up
  3. tx.send got an error, hit unreachable!() and that got eaten somewhere? That would explain all the missing threads - but there's no way it could fail since the rx side is still there.

So my initial reaction is that this looks like a lost wakeup - the code spawned into the executor completed normally, but the rx.wait() somehow lost the wakeup from the value being returned.

I attached to the process with gdb and tried to poke around to see if I could fine the oneshot::Inner to see what state it was in, but I managed to kill the process before getting anywhere (there lack of debug info made it very hard to work out where anything is: rust-lang/rust#53036).

It could be a bug in oneshot, but I wasn't going to rule out a general scheduling problem since I don't really know how all that works.

@sfackler
Copy link
Member

sfackler commented Aug 3, 2018

I've been seeing something potentially related recently as well.

We have a synchronous HTTP client built on top of tokio/hyper. There's a tokio Runtime running in the background, and we spawn off HTTP requests with mpsc::spawn and then .wait() on them to provide the blocking interface.

Recently we've seen the client getting into a bad state where all requests block forever. The synchronous bit is waiting on the oneshot channel in pthread_cond_wait, and the runtime threads are idle. I'm going to try to make a smaller reproduction this weekend.

My gut says that this is probably a tokio issue rather than futures though, yeah.

@StanislavGlebik
Copy link
Author

We just ran into the issue again. I can collect necessary info for debugging

@StanislavGlebik
Copy link
Author

Frame #4 of the stack trace above is in ThreadNotify::park() method. So in gdb I did

f 4
p self.state

and it showed that self.state is 2, and from the source https://github.com/rust-lang-nursery/futures-rs/blob/0.1.23/src/task_impl/std/mod.rs#L492 . So I assume that notify() wasn't called.

@jsgf
Copy link

jsgf commented Aug 5, 2018

Hm, it would be really nice to see what state the oneshot itself is in then. Does it think it has notified?

Edit: Well at least SLEEP is completely consistent with being asleep on the condvar, and at first glance all the logic around park/notify looks sound.

@sfackler
Copy link
Member

sfackler commented Aug 6, 2018

We downgraded to tokio-threadpool 0.1.4 and that seems like it may have fixed things for us - might be worth trying that on your end as well?

@cramertj
Copy link
Member

Closing as stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants