Skip to content

[fix][meta] Fix ephemeral handling of ZK nodes and fix MockZooKeeper ephemeral and ZK stat handling #23988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Feb 14, 2025

Motivation

This PR is a follow-up to #23984 to fix ephemeral status handling in ZKMetadataStore and to improve the ZooKeeper testing infrastructure.

A key addition is enabling pulsar-metadata tests to run against MockZooKeeper through a proper MetadataStoreProvider implementation, ensuring MockZooKeeper correctly implements all features that Pulsar depends on.

Modifications

  1. Fixed Production Code Issues:
  • Fixed ephemeral status handling in ZKMetadataStore to properly detect ephemeral nodes by using the correct comparison value (0L) for non-ephemeral (== persistent) nodes
  1. MockZooKeeper improvements and fixes:
  • Fixed ephemeral node handling and cleanup on session termination
  • Improved ZK stat handling with proper timestamps and version numbers
  • Added proper session ID management and consistent thread safety
  • Fixed persistent watcher cleanup on session termination
  • Fixed getChildren implementation and removed code duplication
  • Fixed ordering issues by migrating to use a single threaded executor model
    • This allowed to remove locks and synchronization in most cases
  1. Test Framework Improvements::
  • Implemented MockZooKeeperMetadataStoreProvider to allow pulsar-metadata tests to use MockZooKeeper
    • This ensures that MockZooKeeper correctly implements all MetadataStore features that Pulsar depends on
  • Enhanced PulsarTestContext to support toggling between MockZooKeeper and real ZK (TestZKServer) implementations
  • Added ability to run MockedPulsarServiceBaseTest tests with both MockZooKeeper and real ZK (TestZKServer)

There are some unrelated NPE fixes and improvements for better exception messages. I faced those issues while debugging BrokerServiceTest and noticed those exceptions in the execution logs. Some of the exceptions were due to invalid tests, however the improvements will be useful also later.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari lhotari requested a review from heesung-sn February 14, 2025 15:40
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Feb 14, 2025
@lhotari lhotari added this to the 4.1.0 milestone Feb 14, 2025
@lhotari lhotari added release/3.0.10 release/3.3.5 release/4.0.3 release/blocker Indicate the PR or issue that should block the release until it gets resolved labels Feb 14, 2025
@lhotari lhotari force-pushed the lh-fix-ephemeral-support-in-MockZooKeeper branch from 55f62ad to b4e04d5 Compare February 17, 2025 12:11
- ensures that it has a separate session
@lhotari lhotari force-pushed the lh-fix-ephemeral-support-in-MockZooKeeper branch from b4e04d5 to 2c7b042 Compare February 17, 2025 12:22
@lhotari lhotari merged commit df51972 into apache:master Feb 17, 2025
51 of 52 checks passed
lhotari added a commit that referenced this pull request Feb 17, 2025
…ephemeral and ZK stat handling (#23988)

(cherry picked from commit df51972)
@lhotari
Copy link
Member Author

lhotari commented Feb 17, 2025

cherry-picking depends on #23686

lhotari added a commit that referenced this pull request Feb 17, 2025
…ephemeral and ZK stat handling (#23988)

(cherry picked from commit df51972)
lhotari added a commit that referenced this pull request Feb 17, 2025
…ephemeral and ZK stat handling (#23988)

(cherry picked from commit df51972)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 19, 2025
…ephemeral and ZK stat handling (apache#23988)

(cherry picked from commit df51972)
(cherry picked from commit 67ae209)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 20, 2025
…ephemeral and ZK stat handling (apache#23988)

(cherry picked from commit df51972)
(cherry picked from commit cd1e9db)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Feb 24, 2025
…ephemeral and ZK stat handling (apache#23988)

(cherry picked from commit df51972)
(cherry picked from commit 67ae209)
nodece pushed a commit to nodece/pulsar that referenced this pull request Mar 27, 2025
…ephemeral and ZK stat handling (apache#23988)

(cherry picked from commit df51972)
@lhotari
Copy link
Member Author

lhotari commented Apr 9, 2025

Some Pulsar tests are using MockZooKeeper's delay method. After #23988, MockZooKeeper is correctly single threaded and a delay will block the execution of this code in ZKSessionWatcher:

try {
zkClientState = future.get(tickTimeMillis, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// Consider zk disconnection if zk operation takes more than TICK_TIME
zkClientState = Watcher.Event.KeeperState.Disconnected;
}
checkState(zkClientState);

This will trigger a ConnectionLost session event. That's why a delay should be kept under 2000 ms.
Currently MockZooKeeper's session timeout is hard coded to 30000ms and the check interval is 1/15 of this, therefore 2000ms. It would be a good improvement to use the configured metadataStoreSessionTimeoutMillis instead.

@lhotari
Copy link
Member Author

lhotari commented Apr 10, 2025

Some Pulsar tests are using MockZooKeeper's delay method. After #23988, MockZooKeeper is correctly single threaded and a delay will block the execution of this code in ZKSessionWatcher:

try {
zkClientState = future.get(tickTimeMillis, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// Consider zk disconnection if zk operation takes more than TICK_TIME
zkClientState = Watcher.Event.KeeperState.Disconnected;
}
checkState(zkClientState);

This will trigger a ConnectionLost session event. That's why a delay should be kept under 2000 ms. Currently MockZooKeeper's session timeout is hard coded to 30000ms and the check interval is 1/15 of this, therefore 2000ms. It would be a good improvement to use the configured metadataStoreSessionTimeoutMillis instead.

This is handled in #24171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants