Skip to content

Fix[BMQIO]: multi thread listen #765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

678098
Copy link
Collaborator

@678098 678098 commented May 29, 2025

No description provided.

@678098 678098 requested a review from a team as a code owner May 29, 2025 18:00
@678098 678098 force-pushed the 250505_fix_multithread_listen branch from b3340f2 to fef9fe6 Compare May 29, 2025 18:29
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2706 of commit fef9fe6 has completed with FAILURE

@678098 678098 force-pushed the 250505_fix_multithread_listen branch 2 times, most recently from 11d0d59 to 9b45ea3 Compare May 30, 2025 14:51
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2711 of commit 9b45ea3 has completed with FAILURE

@@ -92,6 +92,11 @@ void NtcChannelFactory::processListenerResult(
bslstl::SharedPtrUtil::dynamicCast(&alias, channel);
if (alias) {
int catalogHandle = d_channels.add(alias);

// Increment resource usage count for new channel
BALL_LOG_ERROR << "d_resourceMonitor.acquire()";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is a port of other team's patch to NtcChannelFactory.
I caught an IT failure in GH Actions with their original code so I am in the process of debugging it, these comments will go away obviously.

#7  0x000055f93cbebb88 in BloombergLP::bsls::Assert::invokeHandler(BloombergLP::bsls::AssertViolation const&) ()
  No symbol table info available.
  #8  0x000055f93c1597aa in BloombergLP::bmqu::AtomicValidator::release (this=0x55f9660f1ad0) at /home/runner/work/blazingmq/blazingmq/src/groups/bmq/bmqu/bmqu_atomicvalidator.h:245
          violation = {d_comment_p = 0x55f93cfc69d1 "d_count >= 2", d_fileName_p = 0x55f93cfc6980 "/home/runner/work/blazingmq/blazingmq/src/groups/bmq/bmqu/bmqu_atomicvalidator.h", d_lineNumber = 245, d_assertLevel_p = 0x55f93d10f760 <BloombergLP::bsls::Assert::k_LEVEL_SAFE> "SAF"}
  #9  0x000055f93c8a8669 in BloombergLP::bmqio::NtcChannelFactory::processChannelClosed (this=0x55f9660f1930, handle=8388609) at /home/runner/work/blazingmq/blazingmq/src/groups/bmq/bmqio/bmqio_ntcchannelfactory.cpp:164
          channel = {d_ptr_p = 0x7f3d54003070, d_rep_p = 0x7f3d54003050}
          rc = 0

Comment on lines +265 to +264
valGuard.release()->release();
d_validator.invalidate(); // Disallow new listen/connect
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the logic here, I just want to outline my understanding:

  1. Acquire this resource (to get here it must be "valid")
  2. Release it from the guard that's managing it, and also decrement your reference count
  3. Invalidate the resource after decrementing our reference to it

This makes me a little concerned about the ordering in 2 and 3, we're modifying the object after we've relinquished our claim to it. Is this safe to do? e.g. say we release, then another thread gets scheduled and stop gets called there, it executest release and then invalidate, then we context switch back to the original thread, calling invalidate again (which is UB according to its docs).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 different roles when working with bmqu::AtomicValidator:

  • Resource users. They can only acquire/release resources, and it's safe to do from different competing threads.
  • Resource admin. It decides when the resource is valid/invalid, and this decision is not thread-safe. We cannot try to invalidate a resource from different competing threads.
    Regarding your concern, there is only one thread that tries to call bmqio::NtcChannelFactory::stop, and we must not attempt to stop channel factory from different competing threads.

For this code:

  1. We acquire this resource to check that it is still valid, to prevent calling stop multiple times. The same can be achieved with a bool flag, but it's an extra field.
  2. After this, we release this tmp resource.
    First, we don't need it.
    Second, we must release it, or invalidate() that we call next will wait forever.
  3. We call invalidate


lock.release()->unlock();
// Wait until all channels and listeners are finished
d_resourceMonitor.invalidate();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again how do we know d_resourceMonitor is valid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d_validator and d_resourceMonitor are synchronized together like 2 mutexes that are being locked in the same order.
If d_validator was valid in the beginning, we don't do early return. And it also automatically means that d_resourceMonitor is valid.
This is an invariant that we support by calling start/stop from one thread without thread competition.

@678098 678098 force-pushed the 250505_fix_multithread_listen branch from acb926b to 927750c Compare June 13, 2025 14:52
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2753 of commit 927750c has completed with FAILURE

Signed-off-by: Evgeny Malygin <[email protected]>
@678098 678098 force-pushed the 250505_fix_multithread_listen branch from 927750c to 3af3a74 Compare July 9, 2025 18:01
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 2875 of commit 3af3a74 has completed with FAILURE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants