Skip to content

Races between ~ProxyClient<Thread> and ~Connection destructors #189

@ryanofsky

Description

@ryanofsky

#186 fixed several race conditions that happen during the "disconnecting and blocking" unit test, but a new one was reported in Sjors/bitcoin#90 (comment) with details in Sjors/bitcoin#90 (comment) and I want to write more notes about it. There is also a branch with previously attempted fixes

The problem is not too hard to reproduce locally: if mptest binary is run in in a loop in bash, the tsan error happens in a few minutes. With attempted fixes so far running the mptest binary in a loop results in a deadlock after about ~20 minutes for me.

Background

If a Connection object is destroyed while a ProxyClient<Thread> object is still using it (due to a sudden disconnect), the ProxyClient<Thread>::m_disconnect_cb callback will be called from ~Connection destructor, deleting the ProxyClient<Thread> object from whichever ConnThreads map it is in.

If a ProxyClient<Thread> object is destroyed before its Connection object (the more normal case), the ProxyClient<Thread> destructor will unregister m_disconnect_cb so when the Connection is later destroyed it will not try to delete the ProxyClient<Thread> object.

If the two destructors are called in sequence in either order, everything is fine. But if the destructors are called from different threads in a race, numerous problems occur.

Problem 1: invalid m_disconnect_cb iterator erase

If ~Connection is destroyed first and is about to call the m_disconnect_cb callback, but then the ~ProxyClient destructor is called in another thread, it will currently (as of a11e690) try to delete the m_disconnect_cb using an iterator which is no longer valid. 79c09e4 fixes this by changing the ~ProxyClient<Thread> destructor to avoid making deleting m_disconnect_cb once it is invalidated.

Problem 2: ProxyClient<Thread> use-after-free

Second problem is if ~Connection is destroyed first and invokes the m_disconnect_cb callback, and ~ProxyClient<Thread> destructor runs and finishes before m_disconnect_cb has a chance to do anything, when the callback resumes, the ProxyClient<Thread> object will be gone and the callbacks's attempt to access and erase it will be invalid. 79c09e4 fixes this by making ~ProxyClient<Thread> set a destroyed bit when it runs and having the m_disconnect_cb callback just return early if the bit is set.

Problem 3: ThreadContext use after free

Just fixing the race between ~Connection and ~ProxyClient<Thread> is actually not enough because after the ProxyClient<Thread> object is destroyed, the thread owning the ProxyClient<Thread> can exit, destroying the thread_local ThreadContext struct that contained the ProxyClient<Thread> object, and also destroying the ThreadContext.waiter::m_mutex mutex the ProxyClient<Thread> code uses for synchronization. If this happens while an m_disconnect_db callback is still in progress, that callback will fail when it tries to lock the mutex and the mutex longer exists. (Commit 85df929 was a broken attempt to fix this and commit 30431bc has another description of the problem).

Possible Fixes

Adding EventLoop::sync call

The most direct fix for problem 3 is probably to have the thread that destroyed the ProxyClient<Thread> object and is about to exit and destroy ThreadContext::waiter.m_mutex call EventLoop::sync to force m_disconnect_cb to finish running before this happens. This would work because m_disconnect_cb always runs on the event loop thread. This would require reverting 85df929 to avoid reintroducing Problem 2 (so the m_disconnect_cb callback checks destroyed bit after it acquires ThreadContext::waiter.m_mutex instead of before.)

Calling removeSyncCleanup from EventLoop thread

If removeSyncCleanup(m_disconnect_cb) could be called from EventLoop thread I think all previous fixes could be reverted and there would be no problem here because m_disconnect_cb would only be accessed from one thread, and couldn't be removed while it might be running. This would somehow have to be done without ThreadContext::waiter.m_mutex being held (to avoid deadlocks) so I'm not sure if this is possible.

Switching mutexes

I am thinking that cleaner fixes for all the problems above could be possible by switching ProxyClient<Thread> and related code to use EventLoop::m_mutex for some locking instead of ThreadContext::waiter.m_mutex. This could simplify things by not needing to deal with two mutexes and needing to release them to lock in a consistent order. I was even thinking of switching to EventLoop::m_mutex exclusively for all locking, but I don't think I can do this because it's possible for a single thread to make IPC calls over multiple connections and for not all connections to use the same event loop.

Refactoring ThreadContext

ThreadContext callback_threads request_threads maps indexed by Connection* could be replaced with Connection maps indexed by thread_id which could allow not using ThreadContext::waiter.m_mutex at all and using EventLoop::m_mutex everywhere.

Summary

There is a bug and I am not sure how to fix it yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions