Skip to content

[fix][broker] Fix broker shutdown delay by resolving hanging health checks #24210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 24, 2025

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Apr 24, 2025

Fixes #23413

Motivation

When a broker is shutting down while there are ongoing health checks, the process can be significantly delayed. This happens because health check operations continue attempting to reconnect and retry, as evidenced in the logs where the internal Pulsar client keeps performing lookups without proper termination (example logs in https://gist.github.com/lhotari/1caca1bac6fbcff75e9ff6339c1626b5).

Modifications

  • Extracted health check logic into a dedicated HealthChecker class for better separation of concerns
  • Implemented proper handling of pending health check operations during shutdown:
    • Added tracking of pending CompletableFuture operations
    • Added logic to complete these futures with exceptions when the broker is shutting down
  • Created a separate PulsarClient instance for health checks that can be properly closed during shutdown
    • This dedicated client allows health check operations to be properly terminated
    • Reused existing thread pools where appropriate (except for lookup and scheduled executors)
  • Added helper methods to PulsarService for creating customized PulsarClient instances
  • Refactored the heartbeat topic deletion process
  • Added comprehensive javadocs to the new HealthChecker class
  • Moved related tests to the broker group

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari lhotari added this to the 4.1.0 milestone Apr 24, 2025
@lhotari lhotari requested a review from heesung-sn April 24, 2025 11:30
@lhotari lhotari self-assigned this Apr 24, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 24, 2025
@lhotari lhotari changed the title [fix][broker] Fix slow broker shutdown when there are on going health checks [fix][broker] Fix broker shutdown delay by resolving hanging health checks Apr 24, 2025
@lhotari lhotari requested a review from BewareMyPower April 24, 2025 11:43
@lhotari lhotari requested a review from Technoboy- April 24, 2025 11:44
@codecov-commenter
Copy link

codecov-commenter commented Apr 24, 2025

Codecov Report

Attention: Patch coverage is 77.95918% with 54 lines in your changes missing coverage. Please review.

Project coverage is 74.21%. Comparing base (bbc6224) to head (8bf6787).
Report is 1041 commits behind head on master.

Files with missing lines Patch % Lines
...rg/apache/pulsar/broker/service/HealthChecker.java 72.26% 35 Missing and 3 partials ⚠️
...n/java/org/apache/pulsar/broker/PulsarService.java 87.00% 8 Missing and 5 partials ⚠️
...rg/apache/pulsar/broker/service/BrokerService.java 33.33% 1 Missing and 1 partial ⚠️
...g/apache/pulsar/broker/admin/impl/BrokersBase.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24210      +/-   ##
============================================
+ Coverage     73.57%   74.21%   +0.64%     
+ Complexity    32624    32155     -469     
============================================
  Files          1877     1866      -11     
  Lines        139502   144728    +5226     
  Branches      15299    16530    +1231     
============================================
+ Hits         102638   107410    +4772     
+ Misses        28908    28835      -73     
- Partials       7956     8483     +527     
Flag Coverage Δ
inttests 26.71% <47.75%> (+2.13%) ⬆️
systests 23.28% <12.24%> (-1.04%) ⬇️
unittests 73.70% <77.14%> (+0.86%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...g/apache/pulsar/broker/admin/impl/BrokersBase.java 88.51% <80.00%> (+16.05%) ⬆️
...rg/apache/pulsar/broker/service/BrokerService.java 83.74% <33.33%> (+2.96%) ⬆️
...n/java/org/apache/pulsar/broker/PulsarService.java 84.71% <87.00%> (+2.34%) ⬆️
...rg/apache/pulsar/broker/service/HealthChecker.java 72.26% <72.26%> (ø)

... and 1078 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lhotari lhotari merged commit 12961ca into apache:master Apr 24, 2025
52 checks passed
lhotari added a commit that referenced this pull request Apr 24, 2025
lhotari added a commit that referenced this pull request Apr 24, 2025
lhotari added a commit that referenced this pull request Apr 24, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 28, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
manas-ctds added a commit to datastax/pulsar that referenced this pull request Apr 28, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 28, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
manas-ctds added a commit to datastax/pulsar that referenced this pull request Apr 28, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 30, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 30, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 30, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request May 1, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request May 1, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request May 1, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 2, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 6455b02)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request May 2, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 606365a)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request May 6, 2025
…hecks (apache#24210)

(cherry picked from commit 12961ca)
(cherry picked from commit 606365a)
nodece pushed a commit to ascentstream/pulsar that referenced this pull request May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: BrokerRegistryMetadataStoreIntegrationTest.cleanup
3 participants