-
Notifications
You must be signed in to change notification settings - Fork 817
[14833] DiscoveryServer list access deadlock fix <master> #2746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
efcbe21
to
cd2e0c8
Compare
@richiprosima please test this |
@richiprosima please test this again |
@richiprosima please test this yet again |
@richiprosima please test this |
98db776
to
f02e71e
Compare
…cols DiscoveryServers list Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
…oved incorrectly placed lock to the proper place. Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
…ate announcement Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
…cipant creation. Prevented wrong PDP access Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
… test skeleton Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
c20fe2a
to
8e5772c
Compare
@richiprosima Please test this for me 🤯 |
Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
@richiprosima please test this |
…objects. Signed-off-by: Miguel Barro <[email protected]>
@richiprosima please test this |
Signed-off-by: Miguel Barro <[email protected]>
@richiprosima please test this |
…ranularity Signed-off-by: Miguel Barro <[email protected]>
cdc152b
to
f11fe5f
Compare
Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
…ation policy. Because under certain busy loads the callbacks may happen twice per file modification Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
…ependent pull request Signed-off-by: Miguel Barro <[email protected]>
@richiprosima please test this |
Signed-off-by: Miguel Barro <[email protected]>
@richiprosima please test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments.
I'm also a bit concerned that Discovery.TwentyParticipantsSeveralEndpointsUnicast tests have failed on several platforms.
Signed-off-by: Miguel Barro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@richiprosima please test Mac. Let's see what happens with |
|
* Refs #14833: Added new shared mutex to control access to BuiltinProtocols DiscoveryServers list Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Cleanup Signed-off-by: Javier Santiago <[email protected]> * Reversed PDP and Writer lock order inside announceParticipantState. Moved incorrectly placed lock to the proper place. Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Linter pass Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Reversed lock order for PDP server/client participant state announcement Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Fixed Windows Warnings Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Applied suggestions. Improved Filewatch trigger on participant creation. Prevented wrong PDP access Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Restricted PDP lock scope Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Linter pass Signed-off-by: Javier Santiago <[email protected]> * Applied reviewer inline comments Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Fixed wrong addition to discovery server list. Added new test skeleton Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Fixing and Adding new tests. Fixing linux & mac warnings. Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Use specific filenames per test to allow concurrency Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter pass Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Avoiding fix delay on FileWatch callbacks Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang issues Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Low FileWatch granularity from seconds to system_clock one (Windows 100ns/Unix 1ns) Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Speed up ParticipantTests that involve file callbacks Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: trim tests CMakeLists.txt from sources exporting unused objects. Signed-off-by: Miguel Barro <[email protected]> * Refs #14833. partial review Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing SystemInfoTests to account for the new filetime granularity Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Avoid using chrono literals (c++14) Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Account for stat() delay to retrieve actual FileTime info. Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: modify the test to prevent the strict 1 call per modification policy. Because under certain busy loads the callbacks may happen twice per file modification Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter pass Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Improve FileWatch callback filtering using file size Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: New callback filter makes waits redundant Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Undo DiscoveryDatabase fix for a more complete on an independent pull request Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Addressing reviewers' comments Signed-off-by: Miguel Barro <[email protected]> Co-authored-by: Javier Santiago <[email protected]> Co-authored-by: Miguel Barro <[email protected]>
* Refs #14833: Added new shared mutex to control access to BuiltinProtocols DiscoveryServers list Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Cleanup Signed-off-by: Javier Santiago <[email protected]> * Reversed PDP and Writer lock order inside announceParticipantState. Moved incorrectly placed lock to the proper place. Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Linter pass Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Reversed lock order for PDP server/client participant state announcement Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Fixed Windows Warnings Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Applied suggestions. Improved Filewatch trigger on participant creation. Prevented wrong PDP access Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Restricted PDP lock scope Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Linter pass Signed-off-by: Javier Santiago <[email protected]> * Applied reviewer inline comments Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Fixed wrong addition to discovery server list. Added new test skeleton Signed-off-by: Javier Santiago <[email protected]> * Refs #14833: Fixing and Adding new tests. Fixing linux & mac warnings. Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Use specific filenames per test to allow concurrency Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter pass Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Avoiding fix delay on FileWatch callbacks Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang issues Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Low FileWatch granularity from seconds to system_clock one (Windows 100ns/Unix 1ns) Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Speed up ParticipantTests that involve file callbacks Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: trim tests CMakeLists.txt from sources exporting unused objects. Signed-off-by: Miguel Barro <[email protected]> * Refs #14833. partial review Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing SystemInfoTests to account for the new filetime granularity Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Avoid using chrono literals (c++14) Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Account for stat() delay to retrieve actual FileTime info. Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: modify the test to prevent the strict 1 call per modification policy. Because under certain busy loads the callbacks may happen twice per file modification Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Linter pass Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Improve FileWatch callback filtering using file size Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: New callback filter makes waits redundant Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Undo DiscoveryDatabase fix for a more complete on an independent pull request Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Fixing clang warnings Signed-off-by: Miguel Barro <[email protected]> * Refs #14833: Addressing reviewers' comments Signed-off-by: Miguel Barro <[email protected]> Co-authored-by: Javier Santiago <[email protected]> Co-authored-by: Miguel Barro <[email protected]>
Review & thread sanitizer results on #2738
It must be merged after #2745
Description
This PR must be merged after #2730.
This PR fixes Thread Sanitizer issues 4135 and its equivalents.
This deadlock is caused by having the PDP lock taken when iterating over the DiscoveryServer list calling
match_pdp_reader_nts_
, which in turn callsmatched_reader_add_
, locking the Writer lock.As seen in previous Sanitizer issues, these two locks should be taken in reverse order: Writer first then PDP. When a participant is enabled, it takes the mutexes in the proper order, leading to a deadlock.
The deadlock occurs as follows:
PDPServer::createPDPEndpoints
(A) ->StatefulWriter::matched_reader_add
(B)PDPServer::announceParticipantState
(B) ->PDPServer::announceParticipantState
(A)Since the second pair of locks cannot be reversed due to the previous rule. to prevent this issue the use of the PDP mutex is phased out in favor of a new shared_mutex inside the BuildinProtocols header to control access to DIscoveryServer list. This allows the replacement of some PDP locks by new shared_lock calls.
Collateral extra works
As a piggyback the
FileWatch
mechanism to load the Discovery Server List from a file was revamped.This was unavoidable because some
FileWatch
related tests failed as a consequence of the synchronization changes.To summarize the changes:
FileWatch
detection granularity now matches thechrono::system_clock
one in each OS.FileWatch
filtering of redundant callbacks has been improved using file size besides last write time.Contributor Checklist
versions.md
file (if applicable).Reviewer Checklist