Skip to content

[14833] DiscoveryServer list access deadlock fix <master> #2746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jun 30, 2022

Conversation

MiguelBarro
Copy link
Contributor

@MiguelBarro MiguelBarro commented Jun 10, 2022

Review & thread sanitizer results on #2738
It must be merged after #2745

Description

This PR must be merged after #2730.

This PR fixes Thread Sanitizer issues 4135 and its equivalents.

This deadlock is caused by having the PDP lock taken when iterating over the DiscoveryServer list calling match_pdp_reader_nts_, which in turn calls matched_reader_add_, locking the Writer lock.
As seen in previous Sanitizer issues, these two locks should be taken in reverse order: Writer first then PDP. When a participant is enabled, it takes the mutexes in the proper order, leading to a deadlock.

The deadlock occurs as follows:

PDPServer::createPDPEndpoints (A) -> StatefulWriter::matched_reader_add (B)
PDPServer::announceParticipantState (B) -> PDPServer::announceParticipantState (A)

Since the second pair of locks cannot be reversed due to the previous rule. to prevent this issue the use of the PDP mutex is phased out in favor of a new shared_mutex inside the BuildinProtocols header to control access to DIscoveryServer list. This allows the replacement of some PDP locks by new shared_lock calls.

Collateral extra works

As a piggyback the FileWatch mechanism to load the Discovery Server List from a file was revamped.
This was unavoidable because some FileWatch related tests failed as a consequence of the synchronization changes.
To summarize the changes:

  • FileWatch detection granularity now matches the chrono::system_clock one in each OS.
  • FileWatch filtering of redundant callbacks has been improved using file size besides last write time.
  • An new test has been added to test a hotfix in the Discovery Server List update mechanism.

Contributor Checklist

  • Commit messages follow the project guidelines.
  • The code follows the style guidelines of this project.
  • Tests that thoroughly check the new feature have been added/Regression tests checking the bug and its fix have been added.
  • Any new/modified methods have been properly documented using Doxygen.
  • Fast DDS test suite has been run locally.
  • Changes are ABI compatible.
  • Changes are API compatible.
  • Documentation builds and tests pass locally.
  • New feature has been added to the versions.md file (if applicable).
  • New feature has been documented/Current behavior is correctly described in the documentation.

Reviewer Checklist

  • Check contributor checklist is correct.
  • Check CI results: changes do not issue any warning.
  • Check CI results: failing tests are unrelated with the changes.

@MiguelBarro MiguelBarro added the skip-ci Automatically pass CI label Jun 10, 2022
@jsan-rt jsan-rt force-pushed the test/issue/4135 branch 3 times, most recently from efcbe21 to cd2e0c8 Compare June 14, 2022 09:05
@jsan-rt
Copy link
Contributor

jsan-rt commented Jun 15, 2022

@richiprosima please test this

@jsan-rt jsan-rt removed the skip-ci Automatically pass CI label Jun 15, 2022
@jsan-rt
Copy link
Contributor

jsan-rt commented Jun 15, 2022

@richiprosima please test this again

@EduPonz EduPonz added this to the v2.7.0 milestone Jun 16, 2022
@jsan-rt
Copy link
Contributor

jsan-rt commented Jun 16, 2022

@richiprosima please test this yet again

@jsan-rt
Copy link
Contributor

jsan-rt commented Jun 21, 2022

@richiprosima please test this

@MiguelBarro MiguelBarro force-pushed the test/issue/4135 branch 2 times, most recently from 98db776 to f02e71e Compare June 23, 2022 13:05
jsan-rt and others added 14 commits June 23, 2022 16:13
…cols DiscoveryServers list

Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
…oved incorrectly placed lock to the proper place.

Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
…cipant creation. Prevented wrong PDP access

Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Javier Santiago <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
@MiguelBarro
Copy link
Contributor Author

@richiprosima Please test this for me 🤯

@MiguelBarro
Copy link
Contributor Author

@richiprosima please test this

@MiguelBarro
Copy link
Contributor Author

@richiprosima please test this

Signed-off-by: Miguel Barro <[email protected]>
@MiguelBarro
Copy link
Contributor Author

@richiprosima please test this

Miguel Barro added 7 commits June 28, 2022 11:13
…ation policy. Because under certain busy loads the callbacks may happen twice per file modification

Signed-off-by: Miguel Barro <[email protected]>
Signed-off-by: Miguel Barro <[email protected]>
@MiguelBarro
Copy link
Contributor Author

@richiprosima please test this

@MiguelBarro
Copy link
Contributor Author

MiguelBarro commented Jun 28, 2022

@richiprosima please test this
ClipWindowsGIF

Copy link
Member

@MiguelCompany MiguelCompany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

I'm also a bit concerned that Discovery.TwentyParticipantsSeveralEndpointsUnicast tests have failed on several platforms.

Copy link
Member

@MiguelCompany MiguelCompany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MiguelBarro
Copy link
Contributor Author

@richiprosima please test Mac. Let's see what happens with Discovery.TwentyParticipantsSeveralEndpointsUnicast

@MiguelBarro
Copy link
Contributor Author

Discovery.TwentyParticipantsSeveralEndpointsUnicast passed in all its variants.

@MiguelCompany MiguelCompany merged commit 7a2f319 into master Jun 30, 2022
@MiguelCompany MiguelCompany deleted the test/issue/4135 branch June 30, 2022 07:32
MiguelCompany pushed a commit that referenced this pull request Dec 22, 2023
* Refs #14833: Added new shared mutex to control access to BuiltinProtocols DiscoveryServers list

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Cleanup

Signed-off-by: Javier Santiago <[email protected]>

* Reversed PDP and Writer lock order inside announceParticipantState. Moved incorrectly placed lock to the proper place.

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Reversed lock order for PDP server/client participant state announcement

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Fixed Windows Warnings

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Applied suggestions. Improved Filewatch trigger on participant creation. Prevented wrong PDP access

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Restricted PDP lock scope

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Javier Santiago <[email protected]>

* Applied reviewer inline comments

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Fixed wrong addition to discovery server list. Added new test skeleton

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Fixing and Adding new tests. Fixing linux & mac warnings.

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Use specific filenames per test to allow concurrency

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Avoiding fix delay on FileWatch callbacks

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang issues

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Low FileWatch granularity from seconds to system_clock one (Windows 100ns/Unix 1ns)

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Speed up ParticipantTests that involve file callbacks

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: trim tests CMakeLists.txt from sources exporting unused objects.

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833. partial review

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing SystemInfoTests to account for the new filetime granularity

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Avoid using chrono literals (c++14)

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Account for stat() delay to retrieve actual FileTime info.

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: modify the test to prevent the strict 1 call per modification policy. Because under certain busy loads the callbacks may happen twice per file modification

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Improve FileWatch callback filtering using file size

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: New callback filter makes waits redundant

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Undo DiscoveryDatabase fix for a more complete on an independent pull request

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Addressing reviewers' comments

Signed-off-by: Miguel Barro <[email protected]>

Co-authored-by: Javier Santiago <[email protected]>
Co-authored-by: Miguel Barro <[email protected]>
MiguelCompany pushed a commit that referenced this pull request Dec 26, 2023
* Refs #14833: Added new shared mutex to control access to BuiltinProtocols DiscoveryServers list

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Cleanup

Signed-off-by: Javier Santiago <[email protected]>

* Reversed PDP and Writer lock order inside announceParticipantState. Moved incorrectly placed lock to the proper place.

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Reversed lock order for PDP server/client participant state announcement

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Fixed Windows Warnings

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Applied suggestions. Improved Filewatch trigger on participant creation. Prevented wrong PDP access

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Restricted PDP lock scope

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Javier Santiago <[email protected]>

* Applied reviewer inline comments

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Fixed wrong addition to discovery server list. Added new test skeleton

Signed-off-by: Javier Santiago <[email protected]>

* Refs #14833: Fixing and Adding new tests. Fixing linux & mac warnings.

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Use specific filenames per test to allow concurrency

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Avoiding fix delay on FileWatch callbacks

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang issues

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Low FileWatch granularity from seconds to system_clock one (Windows 100ns/Unix 1ns)

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Speed up ParticipantTests that involve file callbacks

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: trim tests CMakeLists.txt from sources exporting unused objects.

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833. partial review

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing SystemInfoTests to account for the new filetime granularity

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Avoid using chrono literals (c++14)

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Account for stat() delay to retrieve actual FileTime info.

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: modify the test to prevent the strict 1 call per modification policy. Because under certain busy loads the callbacks may happen twice per file modification

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Linter pass

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Improve FileWatch callback filtering using file size

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: New callback filter makes waits redundant

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Undo DiscoveryDatabase fix for a more complete on an independent pull request

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Fixing clang warnings

Signed-off-by: Miguel Barro <[email protected]>

* Refs #14833: Addressing reviewers' comments

Signed-off-by: Miguel Barro <[email protected]>

Co-authored-by: Javier Santiago <[email protected]>
Co-authored-by: Miguel Barro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants