-
Notifications
You must be signed in to change notification settings - Fork 997
Add option to validate transactions in the background #4617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/v22.3.0
Are you sure you want to change the base?
Conversation
thanks for putting this together! I'm excited about moving more of transaction processing to the background. It also seems like making tx queue thread-safe is not as dangerous/scary as we thought, which is great. My two main suggestions are to consolidate all read-only ledger state into a single object to avoid footguns, and re-work recvTransaction flow a little bit to remove main thread completely (this way we'll have a synchronous flow independent of main thread, from when the message is popped off the wire all the way to tx queue insertion) |
Closes stellar#4316 and stellar#4607. This change adds a new option `EXPERIMENTAL_BACKGROUND_TX_VALIDATION` that causes stellar-core to validate any transactions received over-the-wire in the background. It also wraps all ledger state needed to validate transactions into a new `ExtendedLedgerSnapshot` object.
tryAdd
functionality in TransactionQueue
I reworked this PR to do only validation in the background, as that's what our perf testing showed had the largest impact. It's now ready for review. |
bool BACKGROUND_OVERLAY_PROCESSING; | ||
|
||
// Enable parallel block application (experimental) | ||
bool EXPERIMENTAL_PARALLEL_LEDGER_APPLY; | ||
|
||
// Enable parallel transaction validation (experimental) | ||
bool BACKGROUND_TX_VALIDATION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for consistency with the rest of the flags in the codebase, should this be named EXPERIMENTAL_BACKGROUND_TX_VALIDATION?
@@ -503,12 +503,12 @@ class CapacityTrackedMessage : private NonMovableOrCopyable | |||
{ | |||
std::weak_ptr<Peer> const mWeakPeer; | |||
StellarMessage const mMsg; | |||
std::optional<Hash> mMaybeHash; | |||
std::shared_ptr<Hash const> mMaybeHash; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helps to avoid copying Hash
objects between various threads. This way we can just copy the pointer instead.
if (mConfig.BACKGROUND_TX_VALIDATION) | ||
{ | ||
// Keep priority unchanged as transaction validation is time-sensitive | ||
mTxValidationThread = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a bit too early to decide if we need a separate thread for this? I think our overlay thread is currently far from being fully utilized, so we could just use it? (and add a new thread if needed in the future)
.header.scpValue.closeTime; | ||
auto offset = | ||
getUpperBoundCloseTimeOffset(mApp.getAppConnector(), closeTime); | ||
mApp.postOnTxValidationThread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we're still bouncing between threads to process a single transaction? Every new transaction is received on the overlay thread. Then, it's scheduled to be processed on the main thread (but not much happens, so it sort of just dead weight for the scheduler?). Then, checkValid is performed on a special tx validation thread, and the result is posted back to main. I think this means that this flow will be as fast as the main thread (scheduler being the bottleneck). I think adding scheduler to the flow might prevent us from observing performance benefits, similar to what we discussed in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do validation in Peer::recvAuthenticatedMessage
which happens in the background already. We can also probably just use the overlay thread itself for validation.
|
||
// A subclass of LedgerSnapshot that contains additional snapshots necessary for | ||
// transaction validation. | ||
class ExtendedLedgerSnapshot : public LedgerSnapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to keep both LedgerSnapshot and ExtendedLedgerSnapshot? If it's just scope, it's fine to clean this up in the follow-up, but wanted to confirm.
|
||
private: | ||
std::shared_ptr<const Config> const mConfig; | ||
std::optional<const SorobanNetworkConfig> const mSorobanNetworkConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit uneasy SorobanNetworkConfig lives here: LedgerSnapshot was meant to be simply a ledger state wrapper, that supports both LedgerTxn and BucketListSnapshot underneath. I was hoping to not cache anything at this layer to avoid ledger state inconsistencies and other footguns. The only exception is mLedgerHeader, which has to be modified for validation purposes. SorobanNetworkConfig really belongs to immutable bucketlist snapshots, since the struct is just a different representation of what lives in the bucketlist already. Could we move it there?
// Ledger snapshot used for background transaction validation. Must be | ||
// updated (replaced) after each new ledger via the `updateSnapshots` | ||
// function. | ||
std::unique_ptr<ExtendedLedgerSnapshot const> mLedgerSnapshot; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safer to create a new LedgerSnapshot every time you need to use it to avoid caching inconsistencies.
@@ -164,6 +165,8 @@ class OverlayManagerImpl : public OverlayManager | |||
void start() override; | |||
void shutdown() override; | |||
|
|||
void updateSnapshots() override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed: the underlying bucket snapshot in LedgerSnapshot is managed by LedgerManager already, so if you create a new LedgerSnapshot every time instead of caching it, you'll automatically get the most recent version of the ledger. No need for snapshot mutex either, since bucketlist are already refreshed in a thread-safe manner.
txResult = | ||
tx->checkValid(mApp.getAppConnector(), ls, 0, 0, | ||
getUpperBoundCloseTimeOffset(mApp, closeTime)); | ||
txResult = tx->checkValid( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there visible performance benefits when doing checkValid vs just calling verifySig on all the signers to populate the existing cache? Reason I'm asking is because skipping checkValid here means that we might be potentially adding invalid transactions to the queue (if ledger state changed while the transaction was being processed). This can be handled, but I wanted to make sure we'd actually get the benefits compared to a much simpler verifySig option. Note that in case of verifySig, we could simply call it in Peer::recvAuthenticatedMessage
similar how it's done for SCP messages, and then still do checkValid on main thread. Because verifySig uses the global cache, we would just get a cache hit and avoid the extra cost of signature verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From tracy testing, signature verification makes up 2/3s to 3/4s of checkValid
. So it's the largest component, but it's also not everything. I'd expect that if we were to scope this down to just signature verifcation, we'd get results that are slightly worse than 2/3s of 3/4s of what we're seeing here, just due to potential lock contention within the cache and the small possibility for bad luck with eviction (it's a random eviction cache, right?).
As far as invalid transactions in the queue go, I was under the impression that the validity of transactions in the queue was already "best effort", especially with Soroban where we can't validate much anyway? That's not to say we should make the problem "worse" without a good reason, but surely we already handle invalid transactions in the queue, right?
bool forApply) | ||
: LedgerSnapshot(ltx) | ||
, mConfig(app.getConfigPtr()) | ||
, mSorobanNetworkConfig(forApply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something here isn't right: ExtendedLedgerSnapshot is supposed to be "the canonical representation of ledger state". Other subsystems are meant to use the snapshot and query it for data, such as ledger entries, soroban config, ledger header etc. But here we still have a dependency on LM to maintain correctness of the config. In #4607 I was hoping to remove cached mSorobanNetworkConfig
completely from LedgerManager, and have it rely on the snapshot instead. The nice thing about having all of the state inside the snapshot is that it's a lot less error-prone; the snapshot will always contain correct read-only state. If some parts of the state are managed by other classes like LedgerManager, that creates a risk of divergence if not updated correctly.
Closes #4316 and #4607.
This change adds a new option
EXPERIMENTAL_BACKGROUND_TX_VALIDATION
that causes stellar-core to validate any transactions received over-the-wire in the background. It also wraps all ledger state needed to validate transactions into a newExtendedLedgerSnapshot
object.