-
Notifications
You must be signed in to change notification settings - Fork 30
Description
#186 fixed several race conditions that happen during the "disconnecting and blocking" unit test, but a new one was reported in Sjors/bitcoin#90 (comment) with details in Sjors/bitcoin#90 (comment) and I want to write more notes about it. There is also a branch with previously attempted fixes
The problem is not too hard to reproduce locally: if mptest
binary is run in in a loop in bash, the tsan error happens in a few minutes. With attempted fixes so far running the mptest binary in a loop results in a deadlock after about ~20 minutes for me.
Background
If a Connection
object is destroyed while a ProxyClient<Thread>
object is still using it (due to a sudden disconnect), the ProxyClient<Thread>::m_disconnect_cb
callback will be called from ~Connection
destructor, deleting the ProxyClient<Thread>
object from whichever ConnThreads
map it is in.
If a ProxyClient<Thread>
object is destroyed before its Connection
object (the more normal case), the ProxyClient<Thread>
destructor will unregister m_disconnect_cb
so when the Connection
is later destroyed it will not try to delete the ProxyClient<Thread>
object.
If the two destructors are called in sequence in either order, everything is fine. But if the destructors are called from different threads in a race, numerous problems occur.
Problem 1: invalid m_disconnect_cb
iterator erase
If ~Connection
is destroyed first and is about to call the m_disconnect_cb
callback, but then the ~ProxyClient
destructor is called in another thread, it will currently (as of a11e690) try to delete the m_disconnect_cb
using an iterator which is no longer valid. 79c09e4 fixes this by changing the ~ProxyClient<Thread>
destructor to avoid making deleting m_disconnect_cb
once it is invalidated.
Problem 2: ProxyClient<Thread>
use-after-free
Second problem is if ~Connection
is destroyed first and invokes the m_disconnect_cb
callback, and ~ProxyClient<Thread>
destructor runs and finishes before m_disconnect_cb
has a chance to do anything, when the callback resumes, the ProxyClient<Thread>
object will be gone and the callbacks's attempt to access and erase it will be invalid. 79c09e4 fixes this by making ~ProxyClient<Thread>
set a destroyed bit when it runs and having the m_disconnect_cb
callback just return early if the bit is set.
Problem 3: ThreadContext
use after free
Just fixing the race between ~Connection
and ~ProxyClient<Thread>
is actually not enough because after the ProxyClient<Thread>
object is destroyed, the thread owning the ProxyClient<Thread>
can exit, destroying the thread_local ThreadContext
struct that contained the ProxyClient<Thread>
object, and also destroying the ThreadContext.waiter::m_mutex
mutex the ProxyClient<Thread>
code uses for synchronization. If this happens while an m_disconnect_db
callback is still in progress, that callback will fail when it tries to lock the mutex and the mutex longer exists. (Commit 85df929 was a broken attempt to fix this and commit 30431bc has another description of the problem).
Possible Fixes
Adding EventLoop::sync call
The most direct fix for problem 3 is probably to have the thread that destroyed the ProxyClient<Thread>
object and is about to exit and destroy ThreadContext::waiter.m_mutex
call EventLoop::sync
to force m_disconnect_cb
to finish running before this happens. This would work because m_disconnect_cb
always runs on the event loop thread. This would require reverting 85df929 to avoid reintroducing Problem 2 (so the m_disconnect_cb
callback checks destroyed bit after it acquires ThreadContext::waiter.m_mutex
instead of before.)
Calling removeSyncCleanup from EventLoop thread
If removeSyncCleanup(m_disconnect_cb)
could be called from EventLoop thread I think all previous fixes could be reverted and there would be no problem here because m_disconnect_cb
would only be accessed from one thread, and couldn't be removed while it might be running. This would somehow have to be done without ThreadContext::waiter.m_mutex
being held (to avoid deadlocks) so I'm not sure if this is possible.
Switching mutexes
I am thinking that cleaner fixes for all the problems above could be possible by switching ProxyClient<Thread>
and related code to use EventLoop::m_mutex
for some locking instead of ThreadContext::waiter.m_mutex
. This could simplify things by not needing to deal with two mutexes and needing to release them to lock in a consistent order. I was even thinking of switching to EventLoop::m_mutex
exclusively for all locking, but I don't think I can do this because it's possible for a single thread to make IPC calls over multiple connections and for not all connections to use the same event loop.
Refactoring ThreadContext
ThreadContext
callback_threads
request_threads
maps indexed by Connection*
could be replaced with Connection
maps indexed by thread_id which could allow not using ThreadContext::waiter.m_mutex
at all and using EventLoop::m_mutex
everywhere.
Summary
There is a bug and I am not sure how to fix it yet.