Skip to content

Increase connection timeout to avoid index sync error #362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

dknopik
Copy link
Member

@dknopik dknopik commented Jun 5, 2025

There is only one connection in our connection pool - which is fine. While the node runs we only write, and when we write, the database locks anyway.

However, as processing batches might take a while, the index syncer can not always get a connection within 5 seconds during index sync. Increasing the timeout to 60 seconds improves the situations.

Note that these failures do not cause data corruption, as the validators would simply be reattempted later, so it is fine if the timeout still is insufficient in edge cases.

@diegomrsantos diegomrsantos requested a review from Copilot June 9, 2025 12:16
Copilot

This comment was marked as outdated.

@diegomrsantos
Copy link
Contributor

Are we sure that only one connection is enough? Increasing from 5 to 60 s is a lot, would a smaller number work?

@diegomrsantos diegomrsantos requested a review from Copilot June 23, 2025 19:26
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR increases the database connection acquisition timeout to reduce index sync failures under lock contention.

  • Increased CONNECTION_TIMEOUT from 5s to 60s
  • Addresses intermittent “cannot get connection” errors during batch processing
Comments suppressed due to low confidence (1)

anchor/database/src/lib.rs:38

  • [nitpick] It may help to add a comment explaining why the pool size is set to 1 and how it relates to connection locking behavior during writes.
const POOL_SIZE: u32 = 1;

@@ -36,7 +36,7 @@ mod validator_operations;
mod tests;

const POOL_SIZE: u32 = 1;
const CONNECTION_TIMEOUT: Duration = Duration::from_secs(5);
const CONNECTION_TIMEOUT: Duration = Duration::from_secs(60);
Copy link
Preview

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider making this timeout configurable (e.g., via environment variable or config file) instead of a hard-coded constant, so future adjustments don’t require a code change.

Copilot uses AI. Check for mistakes.

@dknopik
Copy link
Member Author

dknopik commented Jun 24, 2025

Are we sure that only one connection is enough?

Allowing two writing connections (e.g. syncer and index syncer) would cause "database is locked" errors and require a retry mechanism in our logic. Having a pool with one connection and waiting for it simplifies the logic.

Increasing from 5 to 60 s is a lot, would a smaller number work?

On my machine 5s was too short as the later batches on mainnet are too slow to process. So the index syncer would time out waiting for the connection.

Note that only these writers are blocking - for reading, we do not access the SQL database.

@diegomrsantos
Copy link
Contributor

Allowing two writing connections (e.g. syncer and index syncer) would cause "database is locked" errors and require a retry mechanism in our logic.

Why? Having multiple writing txs is typical in most systems

@diegomrsantos
Copy link
Contributor

Btw what's an index syncer?

@dknopik
Copy link
Member Author

dknopik commented Jun 24, 2025

Why? Having multiple writing txs is typical in most systems

Yeah, but these tx's still cause locking in those systems.

In SQLite, a SQLITE_BUSY is immediately returned by default instead of waiting for the lock to be released. We could use PRAGMA busy_timeout to change this behaviour and have the connections wait for the lock instead. Note that we then need to change #358 to allow multiple connections again, and solve prevention of multiple Anchor instances working on the same data dir separately.

@dknopik
Copy link
Member Author

dknopik commented Jun 24, 2025

Btw what's an index syncer?

https://github.com/sigp/anchor/blob/b724c9829b18e07db3710797c52b19d88e5e88e4/anchor/eth/src/index_sync.rs

It is our component that fetches the validator indices.

@diegomrsantos
Copy link
Contributor

Yeah, but these tx's still cause locking in those systems.

But in more advanced dbms, they don't lock the whole database

@diegomrsantos
Copy link
Contributor

to allow multiple connections again, and solve prevention of multiple Anchor instances working on the same data dir separately.

Is preventing multiple connections the right way of solving this? Is there a way to see that database is already in use when trying to use it again?

@dknopik
Copy link
Member Author

dknopik commented Jun 25, 2025

Is preventing multiple connections the right way of solving this?

It is certainly the easiest way, especially since we do not really gain parallelism with multiple connections as we have no readers, just writers (which lock the DB).

@diegomrsantos
Copy link
Contributor

Is preventing multiple connections the right way of solving this?

It is certainly the easiest way, especially since we do not really gain parallelism with multiple connections as we have no readers, just writers (which lock the DB).

Tbh only one write that locks the whole db doesn't feel right to me.

@dknopik
Copy link
Member Author

dknopik commented Jun 25, 2025

Tbh only one write that locks the whole db doesn't feel right to me.

Even if SQLite supported table-level locking, it would not solve our problem, as the sync AND the index sync frequently need the validator table.

@diegomrsantos
Copy link
Contributor

Tbh only one write that locks the whole db doesn't feel right to me.

Even if SQLite supported table-level locking, it would not solve our problem, as the sync AND the index sync frequently need the validator table.

I don't think modern dbms lock the whole table either

@dknopik
Copy link
Member Author

dknopik commented Jun 25, 2025

I don't think modern dbms lock the whole table either

Some dbms support row level locking - but still does not solve the problem as the syncer might remove a validator while the index syncer tries to update the index. It is also IMO not a good idea to switch to another dbms now.

@diegomrsantos
Copy link
Contributor

I don't think modern dbms lock the whole table either

Some dbms support row level locking - but still does not solve the problem as the syncer might remove a validator while the index syncer tries to update the index. It is also IMO not a good idea to switch to another dbms now.

That's not the point, but rather that contention in one row is smaller than in the whole table or database. That's what Dashmap does.

@diegomrsantos
Copy link
Contributor

We could use PRAGMA busy_timeout to change this behaviour and have the connections wait for the lock instead.

It could be the best for now. How long could it take in the worst-case scenario? Is it acceptable for our system?

@dknopik
Copy link
Member Author

dknopik commented Jun 25, 2025

How long could it take in the worst-case scenario?

The true worst case depends on the users machine. I think 60s should suffice in reasonable cases, if it does not, the index syncer will error and try again later.

Is it acceptable for our system?

Yes, the index syncer can wait :)

Do you have a plan how we prevent multiple Anchor processes from connecting?

@diegomrsantos
Copy link
Contributor

diegomrsantos commented Jun 25, 2025

That's AI suggestion:

Recommended pattern: a tiny advisory lock file

  1. Create a sentinel (eg. $DATA_DIR/.anchor.lock) at start-up.
  2. Take an exclusive advisory lock (flock) on it and keep the file descriptor open for the whole lifetime of the process.
  3. If the lock cannot be obtained immediately, print a clear error and exit.

The lock takes a few bytes, is inherited by child threads, and disappears automatically on crash/SIGKILL when the FD gets closed by the kernel, so there is no manual clean-up race.

Minimal Rust implementation (cross-platform)

use std::{
    fs::OpenOptions,
    path::Path,
};
use fd_lock::FdLock;          // = "fd-lock" crate, uses flock() or Win32 byte-range locks
use anyhow::{Context, Result};

/// Returns a guard that releases automatically on drop.
pub fn acquire_data_dir_lock(data_dir: &Path) -> Result<FdLock<std::fs::File>> {
    let path = data_dir.join(".anchor.lock");

    let file = OpenOptions::new()
        .create(true)
        .read(true)
        .write(true)   // needed for an exclusive lock on Windows
        .open(&path)
        .with_context(|| format!("cannot create lock file {}", path.display()))?;

    let lock = FdLock::new(file);
    // try_write() is non-blocking – ideal for “fail fast”.
    if lock.try_write().is_err() {
        anyhow::bail!(
            "Another Anchor instance already uses {}; aborting.",
            data_dir.display()
        );
    }
    Ok(lock) // keep this guard alive for the whole program
}

Add one line in main before you spawn any threads:

let _data_dir_guard = acquire_data_dir_lock(&config.data_dir)?;

@diegomrsantos
Copy link
Contributor

Btw, the problem isn't only the db, but all files used by the system (like logs), isn't it?

@dknopik
Copy link
Member Author

dknopik commented Jun 25, 2025

Btw, the problem isn't only the db, but all files used by the system (like logs), isn't it?

Yeah. Locking the database would also protect them (effectively) as we open the db early and crash if we can't. Except for a bit of logging before we try to open the db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants