-
Notifications
You must be signed in to change notification settings - Fork 141
Fix[BMQIO]: multi thread listen #765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b3340f2
to
fef9fe6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 2706 of commit fef9fe6 has completed with FAILURE
11d0d59
to
9b45ea3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 2711 of commit 9b45ea3 has completed with FAILURE
@@ -92,6 +92,11 @@ void NtcChannelFactory::processListenerResult( | |||
bslstl::SharedPtrUtil::dynamicCast(&alias, channel); | |||
if (alias) { | |||
int catalogHandle = d_channels.add(alias); | |||
|
|||
// Increment resource usage count for new channel | |||
BALL_LOG_ERROR << "d_resourceMonitor.acquire()"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This branch is a port of other team's patch to NtcChannelFactory.
I caught an IT failure in GH Actions with their original code so I am in the process of debugging it, these comments will go away obviously.
#7 0x000055f93cbebb88 in BloombergLP::bsls::Assert::invokeHandler(BloombergLP::bsls::AssertViolation const&) ()
No symbol table info available.
#8 0x000055f93c1597aa in BloombergLP::bmqu::AtomicValidator::release (this=0x55f9660f1ad0) at /home/runner/work/blazingmq/blazingmq/src/groups/bmq/bmqu/bmqu_atomicvalidator.h:245
violation = {d_comment_p = 0x55f93cfc69d1 "d_count >= 2", d_fileName_p = 0x55f93cfc6980 "/home/runner/work/blazingmq/blazingmq/src/groups/bmq/bmqu/bmqu_atomicvalidator.h", d_lineNumber = 245, d_assertLevel_p = 0x55f93d10f760 <BloombergLP::bsls::Assert::k_LEVEL_SAFE> "SAF"}
#9 0x000055f93c8a8669 in BloombergLP::bmqio::NtcChannelFactory::processChannelClosed (this=0x55f9660f1930, handle=8388609) at /home/runner/work/blazingmq/blazingmq/src/groups/bmq/bmqio/bmqio_ntcchannelfactory.cpp:164
channel = {d_ptr_p = 0x7f3d54003070, d_rep_p = 0x7f3d54003050}
rc = 0
valGuard.release()->release(); | ||
d_validator.invalidate(); // Disallow new listen/connect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by the logic here, I just want to outline my understanding:
- Acquire this resource (to get here it must be "valid")
- Release it from the guard that's managing it, and also decrement your reference count
- Invalidate the resource after decrementing our reference to it
This makes me a little concerned about the ordering in 2 and 3, we're modifying the object after we've relinquished our claim to it. Is this safe to do? e.g. say we release, then another thread gets scheduled and stop
gets called there, it executest release and then invalidate, then we context switch back to the original thread, calling invalidate again (which is UB according to its docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 different roles when working with bmqu::AtomicValidator
:
- Resource users. They can only acquire/release resources, and it's safe to do from different competing threads.
- Resource admin. It decides when the resource is valid/invalid, and this decision is not thread-safe. We cannot try to invalidate a resource from different competing threads.
Regarding your concern, there is only one thread that tries to callbmqio::NtcChannelFactory::stop
, and we must not attempt to stop channel factory from different competing threads.
For this code:
- We acquire this resource to check that it is still valid, to prevent calling
stop
multiple times. The same can be achieved with a bool flag, but it's an extra field. - After this, we release this tmp resource.
First, we don't need it.
Second, we must release it, orinvalidate()
that we call next will wait forever. - We call
invalidate
|
||
lock.release()->unlock(); | ||
// Wait until all channels and listeners are finished | ||
d_resourceMonitor.invalidate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again how do we know d_resourceMonitor
is valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d_validator
and d_resourceMonitor
are synchronized together like 2 mutexes that are being locked in the same order.
If d_validator
was valid in the beginning, we don't do early return. And it also automatically means that d_resourceMonitor
is valid.
This is an invariant that we support by calling start
/stop
from one thread without thread competition.
acb926b
to
927750c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 2753 of commit 927750c has completed with FAILURE
Signed-off-by: Evgeny Malygin <[email protected]>
927750c
to
3af3a74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build 2875 of commit 3af3a74 has completed with FAILURE
No description provided.