Skip to content

Intermittent deadlock when Closing a tree #1017

Open
@mark-rushakoff

Description

@mark-rushakoff

I am frequently encountering this in tests involving multiple SDK apps in the same process, on iavl tag v1.3.2. I am working with the latest cosmos-sdk commit, where async pruning is fixed to true.

The relevant code snippets are:

iavl/nodedb.go

Lines 1122 to 1129 in d89d5d2

func (ndb *nodeDB) Close() error {
ndb.mtx.Lock()
defer ndb.mtx.Unlock()
ndb.cancel()
if ndb.opts.AsyncPruning {
<-ndb.done // wait for the pruning process to finish
}

and

iavl/nodedb.go

Lines 599 to 608 in d89d5d2

func (ndb *nodeDB) startPruning() {
for {
select {
case <-ndb.ctx.Done():
ndb.done <- struct{}{}
return
default:
ndb.mtx.Lock()
toVersion := ndb.pruneVersion
ndb.mtx.Unlock()

(*nodeDB).startPruning runs in its own goroutine, created during newNodeDB. (*nodeDB).Close is called on a separate goroutine, e.g. from closing an SDK commitment store. Flow during the deadlock happens as follows:

  1. The Close goroutine acquires the lock on ndb.mtx
  2. Concurrently, the startPruning goroutine enters the default case and attempts to call ndb.mtx.Lock(), but it cannot acquire the lock until the Close goroutine releases it
  3. Therefore, the Close goroutine is blocked reading from ndb.done because the startPruning goroutine cannot advance past acquiring the lock

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions