Increase connection timeout to avoid index sync error #362

dknopik · 2025-06-05T12:45:30Z

There is only one connection in our connection pool - which is fine. While the node runs we only write, and when we write, the database locks anyway.

However, as processing batches might take a while, the index syncer can not always get a connection within 5 seconds during index sync. Increasing the timeout to 60 seconds improves the situations.

Note that these failures do not cause data corruption, as the validators would simply be reattempted later, so it is fine if the timeout still is insufficient in edge cases.

diegomrsantos · 2025-06-23T19:26:42Z

Are we sure that only one connection is enough? Increasing from 5 to 60 s is a lot, would a smaller number work?

Copilot

Pull Request Overview

This PR increases the database connection acquisition timeout to reduce index sync failures under lock contention.

Increased CONNECTION_TIMEOUT from 5s to 60s
Addresses intermittent “cannot get connection” errors during batch processing

Comments suppressed due to low confidence (1)

anchor/database/src/lib.rs:38

[nitpick] It may help to add a comment explaining why the pool size is set to 1 and how it relates to connection locking behavior during writes.

const POOL_SIZE: u32 = 1;

Copilot · 2025-06-23T19:27:13Z

anchor/database/src/lib.rs

@@ -36,7 +36,7 @@ mod validator_operations;
 mod tests;

 const POOL_SIZE: u32 = 1;
-const CONNECTION_TIMEOUT: Duration = Duration::from_secs(5);
+const CONNECTION_TIMEOUT: Duration = Duration::from_secs(60);


[nitpick] Consider making this timeout configurable (e.g., via environment variable or config file) instead of a hard-coded constant, so future adjustments don’t require a code change.

dknopik · 2025-06-24T06:28:38Z

Are we sure that only one connection is enough?

Allowing two writing connections (e.g. syncer and index syncer) would cause "database is locked" errors and require a retry mechanism in our logic. Having a pool with one connection and waiting for it simplifies the logic.

Increasing from 5 to 60 s is a lot, would a smaller number work?

On my machine 5s was too short as the later batches on mainnet are too slow to process. So the index syncer would time out waiting for the connection.

Note that only these writers are blocking - for reading, we do not access the SQL database.

diegomrsantos · 2025-06-24T07:49:04Z

Allowing two writing connections (e.g. syncer and index syncer) would cause "database is locked" errors and require a retry mechanism in our logic.

Why? Having multiple writing txs is typical in most systems

diegomrsantos · 2025-06-24T07:49:52Z

Btw what's an index syncer?

dknopik · 2025-06-24T08:36:59Z

Why? Having multiple writing txs is typical in most systems

Yeah, but these tx's still cause locking in those systems.

In SQLite, a SQLITE_BUSY is immediately returned by default instead of waiting for the lock to be released. We could use PRAGMA busy_timeout to change this behaviour and have the connections wait for the lock instead. Note that we then need to change #358 to allow multiple connections again, and solve prevention of multiple Anchor instances working on the same data dir separately.

dknopik · 2025-06-24T08:38:42Z

Btw what's an index syncer?

https://github.com/sigp/anchor/blob/b724c9829b18e07db3710797c52b19d88e5e88e4/anchor/eth/src/index_sync.rs

It is our component that fetches the validator indices.

diegomrsantos · 2025-06-24T08:46:25Z

Yeah, but these tx's still cause locking in those systems.

But in more advanced dbms, they don't lock the whole database

diegomrsantos · 2025-06-24T08:49:36Z

to allow multiple connections again, and solve prevention of multiple Anchor instances working on the same data dir separately.

Is preventing multiple connections the right way of solving this? Is there a way to see that database is already in use when trying to use it again?

dknopik · 2025-06-25T14:29:08Z

Is preventing multiple connections the right way of solving this?

It is certainly the easiest way, especially since we do not really gain parallelism with multiple connections as we have no readers, just writers (which lock the DB).

diegomrsantos · 2025-06-25T14:50:51Z

Is preventing multiple connections the right way of solving this?

It is certainly the easiest way, especially since we do not really gain parallelism with multiple connections as we have no readers, just writers (which lock the DB).

Tbh only one write that locks the whole db doesn't feel right to me.

dknopik · 2025-06-25T15:20:22Z

Tbh only one write that locks the whole db doesn't feel right to me.

Even if SQLite supported table-level locking, it would not solve our problem, as the sync AND the index sync frequently need the validator table.

diegomrsantos · 2025-06-25T15:30:49Z

Tbh only one write that locks the whole db doesn't feel right to me.

Even if SQLite supported table-level locking, it would not solve our problem, as the sync AND the index sync frequently need the validator table.

I don't think modern dbms lock the whole table either

dknopik · 2025-06-25T15:43:45Z

I don't think modern dbms lock the whole table either

Some dbms support row level locking - but still does not solve the problem as the syncer might remove a validator while the index syncer tries to update the index. It is also IMO not a good idea to switch to another dbms now.

diegomrsantos · 2025-06-25T16:17:43Z

I don't think modern dbms lock the whole table either

Some dbms support row level locking - but still does not solve the problem as the syncer might remove a validator while the index syncer tries to update the index. It is also IMO not a good idea to switch to another dbms now.

That's not the point, but rather that contention in one row is smaller than in the whole table or database. That's what Dashmap does.

diegomrsantos · 2025-06-25T18:39:45Z

We could use PRAGMA busy_timeout to change this behaviour and have the connections wait for the lock instead.

It could be the best for now. How long could it take in the worst-case scenario? Is it acceptable for our system?

dknopik · 2025-06-25T19:02:38Z

How long could it take in the worst-case scenario?

The true worst case depends on the users machine. I think 60s should suffice in reasonable cases, if it does not, the index syncer will error and try again later.

Is it acceptable for our system?

Yes, the index syncer can wait :)

Do you have a plan how we prevent multiple Anchor processes from connecting?

diegomrsantos · 2025-06-25T19:09:15Z

That's AI suggestion:

Recommended pattern: a tiny advisory lock file

Create a sentinel (eg. $DATA_DIR/.anchor.lock) at start-up.
Take an exclusive advisory lock (flock) on it and keep the file descriptor open for the whole lifetime of the process.
If the lock cannot be obtained immediately, print a clear error and exit.

The lock takes a few bytes, is inherited by child threads, and disappears automatically on crash/SIGKILL when the FD gets closed by the kernel, so there is no manual clean-up race.

Minimal Rust implementation (cross-platform)

use std::{
    fs::OpenOptions,
    path::Path,
};
use fd_lock::FdLock;          // = "fd-lock" crate, uses flock() or Win32 byte-range locks
use anyhow::{Context, Result};

/// Returns a guard that releases automatically on drop.
pub fn acquire_data_dir_lock(data_dir: &Path) -> Result<FdLock<std::fs::File>> {
    let path = data_dir.join(".anchor.lock");

    let file = OpenOptions::new()
        .create(true)
        .read(true)
        .write(true)   // needed for an exclusive lock on Windows
        .open(&path)
        .with_context(|| format!("cannot create lock file {}", path.display()))?;

    let lock = FdLock::new(file);
    // try_write() is non-blocking – ideal for “fail fast”.
    if lock.try_write().is_err() {
        anyhow::bail!(
            "Another Anchor instance already uses {}; aborting.",
            data_dir.display()
        );
    }
    Ok(lock) // keep this guard alive for the whole program
}

Add one line in main before you spawn any threads:

let _data_dir_guard = acquire_data_dir_lock(&config.data_dir)?;

diegomrsantos · 2025-06-25T19:14:24Z

Btw, the problem isn't only the db, but all files used by the system (like logs), isn't it?

dknopik · 2025-06-25T19:18:38Z

Btw, the problem isn't only the db, but all files used by the system (like logs), isn't it?

Yeah. Locking the database would also protect them (effectively) as we open the db early and crash if we can't. Except for a bit of logging before we try to open the db.

Increase connection timeout to avoid index sync error

55054f4

dknopik added the database label Jun 5, 2025

diegomrsantos requested a review from Copilot June 9, 2025 12:16

This comment was marked as outdated.

Sign in to view

diegomrsantos requested a review from Copilot June 23, 2025 19:26

Copilot AI reviewed Jun 23, 2025

View reviewed changes

dknopik mentioned this pull request Jun 25, 2025

Enable exclusive WAL mode for DB #358

Open

Increase connection timeout to avoid index sync error #362

Are you sure you want to change the base?

Increase connection timeout to avoid index sync error #362

Uh oh!

Conversation

dknopik commented Jun 5, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

diegomrsantos commented Jun 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

dknopik commented Jun 24, 2025

Uh oh!

diegomrsantos commented Jun 24, 2025

Uh oh!

diegomrsantos commented Jun 24, 2025

Uh oh!

dknopik commented Jun 24, 2025

Uh oh!

dknopik commented Jun 24, 2025

Uh oh!

diegomrsantos commented Jun 24, 2025

Uh oh!

diegomrsantos commented Jun 24, 2025

Uh oh!

dknopik commented Jun 25, 2025

Uh oh!

diegomrsantos commented Jun 25, 2025

Uh oh!

dknopik commented Jun 25, 2025

Uh oh!

diegomrsantos commented Jun 25, 2025

Uh oh!

dknopik commented Jun 25, 2025

Uh oh!

diegomrsantos commented Jun 25, 2025

Uh oh!

diegomrsantos commented Jun 25, 2025

Uh oh!

dknopik commented Jun 25, 2025

Uh oh!

diegomrsantos commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diegomrsantos commented Jun 25, 2025

Uh oh!

dknopik commented Jun 25, 2025

Uh oh!

Uh oh!

diegomrsantos commented Jun 25, 2025 •

edited

Loading