Skip to content

[fix][txn] Fix deadlock when loading transaction buffer snapshot #24401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

BewareMyPower
Copy link
Contributor

Fixes apache/pulsar-client-go#1376

Motivation

#23062 introduces a possible deadlock. transactionExecutorProvider is actually used by two different places:

  • TopicTransactionBuffer
  • PendingAckHandleImpl

return pendingAckHandle.pendingAckHandleFuture().thenCompose(future -> {

The future of PendingAckHandleImpl is completed in this executor, while the executor might be executing the blocking snapshot replay task. When there is only 1 thread in transactionExecutorProvider, the reader requires the dispatcher for messages, while the PendingAckHandleImpl object cannot complete its future because the transaction-executor thread is occupied.

There is another bug that when the reader fails with a non-retriable error, it will still exist in the cache at least for 60 seconds. See

private static final long CACHE_EXPIRE_TIMEOUT_MS = 60 * 1000L;

Modifications

Reduce numTransactionReplayThreadPoolSize and managedLedgerNumSchedulerThreads to 1. After that, TransactionTest will always fail at testCreateTransactionSystemTopic when creating a consumer

There are two major fixes:

  1. Add a separated executor service for the blocking snapshot replay task.
  2. Record the exception when the consumer's state becomes Fail and fail the getLastMessageId RPC immediately after that. In TableView#readLatest, remove the reader from cache if readToLatest fail.

With the 1st fix, testCreateTransactionSystemTopic will succeed but TransactionTest will still fail in testDeleteNamespace. It's because the previous testCreateTransactionSystemTopic recreated the tnx/ns1 namespace but the reader will keep reconnecting with Fail state to get the last message id. The 2nd fix will fix it.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@BewareMyPower BewareMyPower changed the title [fix][transaction] Fix deadlock when loading transaction buffer snapshot [fix][txn] Fix deadlock when loading transaction buffer snapshot Jun 9, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 9, 2025
@BewareMyPower BewareMyPower self-assigned this Jun 9, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jun 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.24%. Comparing base (bbc6224) to head (2d8971f).
Report is 1140 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24401      +/-   ##
============================================
+ Coverage     73.57%   74.24%   +0.66%     
- Complexity    32624    32675      +51     
============================================
  Files          1877     1867      -10     
  Lines        139502   145308    +5806     
  Branches      15299    16615    +1316     
============================================
+ Hits         102638   107879    +5241     
+ Misses        28908    28878      -30     
- Partials       7956     8551     +595     
Flag Coverage Δ
inttests 26.73% <50.00%> (+2.15%) ⬆️
systests 23.29% <26.47%> (-1.03%) ⬇️
unittests 73.72% <100.00%> (+0.87%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...n/java/org/apache/pulsar/broker/PulsarService.java 83.75% <100.00%> (+1.38%) ⬆️
...er/impl/SingleSnapshotAbortedTxnProcessorImpl.java 83.13% <100.00%> (-0.04%) ⬇️
...r/impl/SnapshotSegmentAbortedTxnProcessorImpl.java 81.76% <100.00%> (+3.25%) ⬆️
...lsar/broker/transaction/buffer/impl/TableView.java 84.84% <100.00%> (ø)
...main/java/org/apache/pulsar/utils/SimpleCache.java 100.00% <100.00%> (ø)
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 79.81% <100.00%> (+2.24%) ⬆️

... and 1074 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@BewareMyPower BewareMyPower merged commit 1b7e4a7 into apache:master Jun 11, 2025
62 of 67 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-txn-snapshot-block branch June 12, 2025 06:20
BewareMyPower added a commit that referenced this pull request Jun 18, 2025
BewareMyPower added a commit that referenced this pull request Jun 18, 2025
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 19, 2025
lhotari pushed a commit that referenced this pull request Jun 19, 2025
nodece pushed a commit to nodece/pulsar that referenced this pull request Jun 20, 2025
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 21, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky Test] TestMessageSingleRouter
4 participants