Skip to content

background overlay: move doing transaction validation in the background before forwarding to tx queue, fast failing invalid transactions #4316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anupsdf opened this issue May 14, 2024 · 6 comments · May be fixed by #4617
Assignees

Comments

@anupsdf
Copy link
Contributor

anupsdf commented May 14, 2024

No description provided.

@marta-lokhova marta-lokhova changed the title background overlay: move doing transaction validation in the background before forwarding to tx queue, which allows fast failing invalid transactions background overlay: move doing transaction validation in the background before forwarding to tx queue, fast failing invalid transactions Jun 5, 2024
@marta-lokhova
Copy link
Contributor

Note that with the switch to BucketList snapshots, we're losing LedgerTxn's internal entry cache, which might be problematic for performance.

@MonsieurNicolas
Copy link
Contributor

Note that with the switch to BucketList snapshots, we're losing LedgerTxn's internal entry cache, which might be problematic for performance.

it sounds like we need a proper performance evaluation to ensure that we don't introduce performance regressions both when closing ledgers (catchup) and when flooding.

@bboston7
Copy link
Contributor

bboston7 commented Oct 9, 2024

Here are 3 options I see, in order of increasing amounts of backgrounding:

  1. Background the signature checks only. With this option, we would check transaction signatures in the background upon receiving them over the network. If signature verification succeeds, we continue with the current Herder::recvTransaction flow in the main thread. Otherwise, we drop the transaction. Then, when core validates the transaction it will achieve a cache hit in the signature cache, thus reducing the total work the main thread has to do. The benefits of this option is that it is small and simplest to get the threading right, while also tackling one of the most expensive parts of transaction validation.
  2. Background TransactionFrame::checkValid. With this option, we would run checkValid on transactions upon receiving them over the network. If they pass, they'll be passed on to the transaction queue in the main thread for the remaining checks and final adding to the queue. The benefit of this option is that it moves even more to the background, with the drawback of requiring even more work to achieve thread safety (for example, checkValid must be refactored to not take an Application).
  3. Background TransactionQueue::canAdd. This option backgrounds all of the checks run when adding a transaction to the queue. The benefit of this option is that it includes some structural checks that (1) and (2) don't, and therefore it preserves the same order of checks that stellar-core currently performs. The drawback is that it requires even more care to achieve safe multithreading, and I suspect that these structural checks do not make up a significant portion of time taken to add a transaction to the queue. This may be beyond the point of diminishing returns for backgrounding transaction validation.

Note that these options only apply to transactions received over the network. There will be no backgrounding of validation for transactions received over the HTTP endpoint. That will preserve the current order of checks (and therefore error codes) for locally submitted transactions.

I plan to use Tracy to figure out what the most expensive portions of transaction validation are to help answer which of these options we should take, but I also wanted to put these out there to gather feedback from anyone who has an opinion on these three choices (or any others that I may have missed).

@marta-lokhova
Copy link
Contributor

marta-lokhova commented Oct 15, 2024

Thanks for putting together the options! I think we can carve out a change that's not too complicated in terms of implementation, but yields a good perf gain. Thinking about this more, a few questions:

  • I think option 3 requires a significant amount of work, since this means we'd need to add synchronization to the whole transactions queue class. What do we get in terms of perf gains? For the error checking, we should be able to re-arrange things to avoid unnecessary synchronization.
  • checkValid should hopefully be fairly simple to make thread-safe given that we use snapshots now. Also, Cleanup apply and validation flows #4471 refactors checkValid to take AppConnector, which forces us to either run it from the main thread, or implement thread-safe methods. That being said, what benefits do we gain from going the checkValid route vs just signature verification?
  • Another idea (a bit more involved, but could yield really good benefits): if we're able to do checkValid in the background, what benefits would be gain from flooding valid transactions right away from the background? This would remove the main thread head-of-line blocking issue that txs face right now. Note that we'd need to do something about the flooding schedule: that is, currently we don't eagerly flood right away, but flood in batches on a schedule (via FLOOD_OP_RATE_PER_LEDGER and FLOOD_TX_PERIOD_MS configs ). The periodic flooding is done from the main thread right now. Implementation-wise, this is closer to option 3, since we'd need to synchronize access to the tx queue, but I think this may yield a pretty dramatic improvement to tx flooding.
  • I agree that some experimentation and understanding of end-to-end latency of transaction processing would be really useful to understand the path forward here.

@marta-lokhova
Copy link
Contributor

Thinking more about synchronization at the transaction queue layer: a simpler approach might be removing overlay dependency on the transaction queue completely. This way, overlay would maintain its own pool of transactions to flood, removing the need to share state with main. I believe that would be a step in the right direction, as it would further isolate the "relay" component of core.

bboston7 added a commit to bboston7/stellar-core that referenced this issue Jan 9, 2025
This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.
bboston7 added a commit to bboston7/stellar-core that referenced this issue Jan 9, 2025
This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.
bboston7 added a commit to bboston7/stellar-core that referenced this issue Jan 16, 2025
This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.
bboston7 added a commit to bboston7/stellar-core that referenced this issue Feb 12, 2025
This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.
bboston7 added a commit to bboston7/stellar-core that referenced this issue Mar 6, 2025
This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.
bboston7 added a commit to bboston7/stellar-core that referenced this issue Mar 25, 2025
This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.
bboston7 added a commit to bboston7/stellar-core that referenced this issue Apr 2, 2025
WIP: Background `tryAdd` functionality in TransactionQueue

This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.

Restore protocol version check in `SorobanTransactionQueue::getMaxQueueSizeOps`

Fix missing soroban network config issue

This is what was breaking catchup

recursive_mutex -> mutex

Remove redundant passing around of protocol version

Fix error introduced in rebase

Fix tx queue tests

Fix integration with bg ledger close

Typo

Push cut point out through receiving messages

Push cut point out a tiny bit

todo

Identify where to push back next

Atomic ledger state

Refactor out tx metrics from queue adding

In preparation for moving the cut point

Consistency

Getting close to having this all pushed out

Just need to wire up the final bit

Remove old post to bg thread

Indirection around tx queue storage

Feed tx queues to peer

Wire up background flow

Debug output

Fix tracy build

More debug info around broadcasting

Maybe fix deadlock

Fix deadlock with TCPPeer

Reduce main thread posts on broadcast

Add thread

Though it's not being used anywhere

Actually run on new thread

Another memory leak fix

Fix `maybeExecuteInBackground` to check thread type

Save main thread type

Prioritize main thread in locking

Some cleanup around separate tx queue thread

Post-rebase fixes

Formatting

More tracy instrumentation

Builds with reverted tx queue changes
bboston7 added a commit to bboston7/stellar-core that referenced this issue Apr 2, 2025
WIP: Background `tryAdd` functionality in TransactionQueue

This is a *draft* change that will resolve stellar#4316 when it is complete.
The change makes `TransactionQueue` thread safe and runs the `tryAdd`
function in the background when the feature is enabled. The
implementation closely follows the
[design document](https://docs.google.com/document/d/1pU__XfEp-rR-17TNsuj-VhY6JfyendaFSYLTiq6tIj4/edit?usp=sharing)
I wrote.  The implementation still requires the main thread to
re-broadcast the transactions (for now). I've opened this PR for
visibility / early feedback on the implementation.

This change is very much a work in progress, with the following tasks
remaining:

* [ ] Fix catchup. I seem to have broken catchup in rebasing these
      changes on master. I need to figure out what is going on there and fix
      it.
* [ ] Fix failing tests. These are failing because they don't update
      `TransactionQueue`s new snapshots correctly.
* [ ] Rigorous testing, both for correctness and performance.
* [ ] I'd like to take a look at pushing the cut-point out a bit to
      enable flooding in the background as well. If this is a relatively
      simple change, I'd like to roll it into this PR. If it looks hairy,
      then I'll leave it for a separate change later.

Restore protocol version check in `SorobanTransactionQueue::getMaxQueueSizeOps`

Fix missing soroban network config issue

This is what was breaking catchup

recursive_mutex -> mutex

Remove redundant passing around of protocol version

Fix error introduced in rebase

Fix tx queue tests

Fix integration with bg ledger close

Typo

Push cut point out through receiving messages

Push cut point out a tiny bit

todo

Identify where to push back next

Atomic ledger state

Refactor out tx metrics from queue adding

In preparation for moving the cut point

Consistency

Getting close to having this all pushed out

Just need to wire up the final bit

Remove old post to bg thread

Indirection around tx queue storage

Feed tx queues to peer

Wire up background flow

Debug output

Fix tracy build

More debug info around broadcasting

Maybe fix deadlock

Fix deadlock with TCPPeer

Reduce main thread posts on broadcast

Add thread

Though it's not being used anywhere

Actually run on new thread

Another memory leak fix

Fix `maybeExecuteInBackground` to check thread type

Save main thread type

Prioritize main thread in locking

Some cleanup around separate tx queue thread

Post-rebase fixes

Formatting

More tracy instrumentation

Builds with reverted tx queue changes
bboston7 added a commit to bboston7/stellar-core that referenced this issue Apr 8, 2025
Closes stellar#4316 and stellar#4607.

This change adds a new option `EXPERIMENTAL_BACKGROUND_TX_VALIDATION`
that causes stellar-core to validate any transactions received
over-the-wire in the background. It also wraps all ledger state needed
to validate transactions into a new `ExtendedLedgerSnapshot` object.
@marta-lokhova
Copy link
Contributor

Done in #4699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@bboston7 @MonsieurNicolas @marta-lokhova @SirTyson @anupsdf and others