-
Notifications
You must be signed in to change notification settings - Fork 18
Increase connection timeout to avoid index sync error #362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Are we sure that only one connection is enough? Increasing from 5 to 60 s is a lot, would a smaller number work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR increases the database connection acquisition timeout to reduce index sync failures under lock contention.
- Increased
CONNECTION_TIMEOUT
from 5s to 60s - Addresses intermittent “cannot get connection” errors during batch processing
Comments suppressed due to low confidence (1)
anchor/database/src/lib.rs:38
- [nitpick] It may help to add a comment explaining why the pool size is set to 1 and how it relates to connection locking behavior during writes.
const POOL_SIZE: u32 = 1;
@@ -36,7 +36,7 @@ mod validator_operations; | |||
mod tests; | |||
|
|||
const POOL_SIZE: u32 = 1; | |||
const CONNECTION_TIMEOUT: Duration = Duration::from_secs(5); | |||
const CONNECTION_TIMEOUT: Duration = Duration::from_secs(60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider making this timeout configurable (e.g., via environment variable or config file) instead of a hard-coded constant, so future adjustments don’t require a code change.
Copilot uses AI. Check for mistakes.
Allowing two writing connections (e.g. syncer and index syncer) would cause "database is locked" errors and require a retry mechanism in our logic. Having a pool with one connection and waiting for it simplifies the logic.
On my machine 5s was too short as the later batches on mainnet are too slow to process. So the index syncer would time out waiting for the connection. Note that only these writers are blocking - for reading, we do not access the SQL database. |
Why? Having multiple writing txs is typical in most systems |
Btw what's an index syncer? |
Yeah, but these tx's still cause locking in those systems. In SQLite, a |
It is our component that fetches the validator indices. |
But in more advanced dbms, they don't lock the whole database |
Is preventing multiple connections the right way of solving this? Is there a way to see that database is already in use when trying to use it again? |
It is certainly the easiest way, especially since we do not really gain parallelism with multiple connections as we have no readers, just writers (which lock the DB). |
Tbh only one write that locks the whole db doesn't feel right to me. |
Even if SQLite supported table-level locking, it would not solve our problem, as the sync AND the index sync frequently need the |
I don't think modern dbms lock the whole table either |
Some dbms support row level locking - but still does not solve the problem as the syncer might remove a validator while the index syncer tries to update the index. It is also IMO not a good idea to switch to another dbms now. |
That's not the point, but rather that contention in one row is smaller than in the whole table or database. That's what Dashmap does. |
It could be the best for now. How long could it take in the worst-case scenario? Is it acceptable for our system? |
The true worst case depends on the users machine. I think 60s should suffice in reasonable cases, if it does not, the index syncer will error and try again later.
Yes, the index syncer can wait :) Do you have a plan how we prevent multiple Anchor processes from connecting? |
That's AI suggestion: Recommended pattern: a tiny advisory lock file
The lock takes a few bytes, is inherited by child threads, and disappears automatically on crash/SIGKILL when the FD gets closed by the kernel, so there is no manual clean-up race. Minimal Rust implementation (cross-platform) use std::{
fs::OpenOptions,
path::Path,
};
use fd_lock::FdLock; // = "fd-lock" crate, uses flock() or Win32 byte-range locks
use anyhow::{Context, Result};
/// Returns a guard that releases automatically on drop.
pub fn acquire_data_dir_lock(data_dir: &Path) -> Result<FdLock<std::fs::File>> {
let path = data_dir.join(".anchor.lock");
let file = OpenOptions::new()
.create(true)
.read(true)
.write(true) // needed for an exclusive lock on Windows
.open(&path)
.with_context(|| format!("cannot create lock file {}", path.display()))?;
let lock = FdLock::new(file);
// try_write() is non-blocking – ideal for “fail fast”.
if lock.try_write().is_err() {
anyhow::bail!(
"Another Anchor instance already uses {}; aborting.",
data_dir.display()
);
}
Ok(lock) // keep this guard alive for the whole program
} Add one line in main before you spawn any threads: let _data_dir_guard = acquire_data_dir_lock(&config.data_dir)?; |
Btw, the problem isn't only the db, but all files used by the system (like logs), isn't it? |
Yeah. Locking the database would also protect them (effectively) as we open the db early and crash if we can't. Except for a bit of logging before we try to open the db. |
There is only one connection in our connection pool - which is fine. While the node runs we only write, and when we write, the database locks anyway.
However, as processing batches might take a while, the index syncer can not always get a connection within 5 seconds during index sync. Increasing the timeout to 60 seconds improves the situations.
Note that these failures do not cause data corruption, as the validators would simply be reattempted later, so it is fine if the timeout still is insufficient in edge cases.