Skip to content

Add option to validate transactions in the background #4617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release/v22.3.0
Choose a base branch
from

Conversation

bboston7
Copy link
Contributor

@bboston7 bboston7 commented Jan 9, 2025

Closes #4316 and #4607.

This change adds a new option EXPERIMENTAL_BACKGROUND_TX_VALIDATION that causes stellar-core to validate any transactions received over-the-wire in the background. It also wraps all ledger state needed to validate transactions into a new ExtendedLedgerSnapshot object.

@bboston7 bboston7 requested a review from marta-lokhova January 9, 2025 19:27
@marta-lokhova
Copy link
Contributor

marta-lokhova commented Jan 10, 2025

thanks for putting this together! I'm excited about moving more of transaction processing to the background. It also seems like making tx queue thread-safe is not as dangerous/scary as we thought, which is great. My two main suggestions are to consolidate all read-only ledger state into a single object to avoid footguns, and re-work recvTransaction flow a little bit to remove main thread completely (this way we'll have a synchronous flow independent of main thread, from when the message is popped off the wire all the way to tx queue insertion)

Closes stellar#4316 and stellar#4607.

This change adds a new option `EXPERIMENTAL_BACKGROUND_TX_VALIDATION`
that causes stellar-core to validate any transactions received
over-the-wire in the background. It also wraps all ledger state needed
to validate transactions into a new `ExtendedLedgerSnapshot` object.
@bboston7 bboston7 changed the base branch from master to release/v22.3.0 April 8, 2025 18:09
@bboston7 bboston7 changed the title WIP: Background tryAdd functionality in TransactionQueue Add option to validate transactions in the background Apr 8, 2025
@bboston7 bboston7 marked this pull request as ready for review April 8, 2025 18:10
@bboston7
Copy link
Contributor Author

bboston7 commented Apr 8, 2025

I reworked this PR to do only validation in the background, as that's what our perf testing showed had the largest impact. It's now ready for review.

bool BACKGROUND_OVERLAY_PROCESSING;

// Enable parallel block application (experimental)
bool EXPERIMENTAL_PARALLEL_LEDGER_APPLY;

// Enable parallel transaction validation (experimental)
bool BACKGROUND_TX_VALIDATION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for consistency with the rest of the flags in the codebase, should this be named EXPERIMENTAL_BACKGROUND_TX_VALIDATION?

@@ -503,12 +503,12 @@ class CapacityTrackedMessage : private NonMovableOrCopyable
{
std::weak_ptr<Peer> const mWeakPeer;
StellarMessage const mMsg;
std::optional<Hash> mMaybeHash;
std::shared_ptr<Hash const> mMaybeHash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps to avoid copying Hash objects between various threads. This way we can just copy the pointer instead.

if (mConfig.BACKGROUND_TX_VALIDATION)
{
// Keep priority unchanged as transaction validation is time-sensitive
mTxValidationThread =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bit too early to decide if we need a separate thread for this? I think our overlay thread is currently far from being fully utilized, so we could just use it? (and add a new thread if needed in the future)

.header.scpValue.closeTime;
auto offset =
getUpperBoundCloseTimeOffset(mApp.getAppConnector(), closeTime);
mApp.postOnTxValidationThread(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we're still bouncing between threads to process a single transaction? Every new transaction is received on the overlay thread. Then, it's scheduled to be processed on the main thread (but not much happens, so it sort of just dead weight for the scheduler?). Then, checkValid is performed on a special tx validation thread, and the result is posted back to main. I think this means that this flow will be as fast as the main thread (scheduler being the bottleneck). I think adding scheduler to the flow might prevent us from observing performance benefits, similar to what we discussed in this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do validation in Peer::recvAuthenticatedMessage which happens in the background already. We can also probably just use the overlay thread itself for validation.


// A subclass of LedgerSnapshot that contains additional snapshots necessary for
// transaction validation.
class ExtendedLedgerSnapshot : public LedgerSnapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to keep both LedgerSnapshot and ExtendedLedgerSnapshot? If it's just scope, it's fine to clean this up in the follow-up, but wanted to confirm.


private:
std::shared_ptr<const Config> const mConfig;
std::optional<const SorobanNetworkConfig> const mSorobanNetworkConfig;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit uneasy SorobanNetworkConfig lives here: LedgerSnapshot was meant to be simply a ledger state wrapper, that supports both LedgerTxn and BucketListSnapshot underneath. I was hoping to not cache anything at this layer to avoid ledger state inconsistencies and other footguns. The only exception is mLedgerHeader, which has to be modified for validation purposes. SorobanNetworkConfig really belongs to immutable bucketlist snapshots, since the struct is just a different representation of what lives in the bucketlist already. Could we move it there?

// Ledger snapshot used for background transaction validation. Must be
// updated (replaced) after each new ledger via the `updateSnapshots`
// function.
std::unique_ptr<ExtendedLedgerSnapshot const> mLedgerSnapshot;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer to create a new LedgerSnapshot every time you need to use it to avoid caching inconsistencies.

@@ -164,6 +165,8 @@ class OverlayManagerImpl : public OverlayManager
void start() override;
void shutdown() override;

void updateSnapshots() override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed: the underlying bucket snapshot in LedgerSnapshot is managed by LedgerManager already, so if you create a new LedgerSnapshot every time instead of caching it, you'll automatically get the most recent version of the ledger. No need for snapshot mutex either, since bucketlist are already refreshed in a thread-safe manner.

txResult =
tx->checkValid(mApp.getAppConnector(), ls, 0, 0,
getUpperBoundCloseTimeOffset(mApp, closeTime));
txResult = tx->checkValid(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there visible performance benefits when doing checkValid vs just calling verifySig on all the signers to populate the existing cache? Reason I'm asking is because skipping checkValid here means that we might be potentially adding invalid transactions to the queue (if ledger state changed while the transaction was being processed). This can be handled, but I wanted to make sure we'd actually get the benefits compared to a much simpler verifySig option. Note that in case of verifySig, we could simply call it in Peer::recvAuthenticatedMessage similar how it's done for SCP messages, and then still do checkValid on main thread. Because verifySig uses the global cache, we would just get a cache hit and avoid the extra cost of signature verification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From tracy testing, signature verification makes up 2/3s to 3/4s of checkValid. So it's the largest component, but it's also not everything. I'd expect that if we were to scope this down to just signature verifcation, we'd get results that are slightly worse than 2/3s of 3/4s of what we're seeing here, just due to potential lock contention within the cache and the small possibility for bad luck with eviction (it's a random eviction cache, right?).

As far as invalid transactions in the queue go, I was under the impression that the validity of transactions in the queue was already "best effort", especially with Soroban where we can't validate much anyway? That's not to say we should make the problem "worse" without a good reason, but surely we already handle invalid transactions in the queue, right?

bool forApply)
: LedgerSnapshot(ltx)
, mConfig(app.getConfigPtr())
, mSorobanNetworkConfig(forApply
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something here isn't right: ExtendedLedgerSnapshot is supposed to be "the canonical representation of ledger state". Other subsystems are meant to use the snapshot and query it for data, such as ledger entries, soroban config, ledger header etc. But here we still have a dependency on LM to maintain correctness of the config. In #4607 I was hoping to remove cached mSorobanNetworkConfig completely from LedgerManager, and have it rely on the snapshot instead. The nice thing about having all of the state inside the snapshot is that it's a lot less error-prone; the snapshot will always contain correct read-only state. If some parts of the state are managed by other classes like LedgerManager, that creates a risk of divergence if not updated correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants