-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Hack to skip cleanup_dead_slots upon snapshot load #8561
Conversation
This PR is first worked on the Bad:
Good:
|
let (accounts, skip_account_assertion) = f(accounts, current_slot); | ||
|
||
assert_eq!(4, accounts.accounts_index.read().unwrap().roots.len()); | ||
if skip_account_assertion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following assertion causes failure currently for test_accounts_purge_chained_purge_after_snapshot_restore.
This should be fixed by #8337
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me as a workaround to get TdS unstuck. It would be nice to find a cleaner way, if possible, as a follow-up.
Thank you for the quick turnaround @ryoqun, you rock!
Bench failure is unrelated, |
This got needed in the v0.23 branch for #8436 |
@Mergifyio refresh |
Problem
A validator started from snapshot may create bad snapshots, which cause bank hash mismatch error when consumed by other validators. This is because the validator internally removes too much AccountsStorage and bad snapshots don't include some necessary ones anymore.
Detail of problem
generate_index
has been broken for long time about maintaining the correct count of storage's counts, which confusespurge_zero_lamport_accounts()
(demonstrating assertion are currently disabled in this pr, real fix #8337 is pending).This causes too many slots to be incorrectly marked as dead.
However, it hadn't caused problems because
cleanup_dead_slots()
had been broken too when called immediately after snapshot restoration. That function didn't removedead_slots
at all in that case.In short, the caller wrongly requested too many
slot
s to remove and the callee wrongly didn't anything. So, there had been a peace...But it changed since #8148. This PR is needed as one of fixes for the enlarged snapshot issue #8168. And that PR was thought rather harmless..
After #8148, the callee started to remove actually. But things don't break immediately. Snapshots load successfully. And validator just works.
Its rather subtle ramification of unbalanced fix is at the future snapshot creation: Because
purge_zero_lamport_accounts()
incorrectly removes too many slots as dead in some corner case, newly-created snapshots silently start to omit to include that removedstorage
s backing that slots, causing the deadlyBankHashMismatch
error because of obvious incorrect view into the accounts set.Summary of Changes
Until #8337 lands, just disable erroneous too eager removing behavior as an ugly hack...
This PR should have the same effect as reverting #8148, with better merge-conflict-free characteristics. #8337 should revert most of changes in this PR.
Fixes #8560
Backport only to v1.0 because v0.23 was branched before #8148.