Skip to content

Fix crash at exit due to a race condition with new signal handler #2545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 28, 2024

Conversation

azeey
Copy link
Contributor

@azeey azeey commented Aug 26, 2024

🦟 Bug fix

Fixes #

Summary

Previously, the signal handler would interrupt the main thread (thread that calls Server::Run), so there was no chance for the main thread to resume before the signal handler was finished. With the new signal handler implementation in gz::common, the signal handler is now called from a different thread which means the main thread might resume before the signal handler is finished. This causes a race condition in which the Server class, and therefore the ServerPrivate class are being destructed while the signal handler is still active. Since ServerPrivate also contains the class that encapsulates the signal handler implementation, the causes a crash as the signal handler tries to access data that has been deleted.

The fix here is to ensure that the main thread does not exit until the signal handler is finished by synchronizing on a mutex.

Checklist

  • Signed all commits for DCO
  • Added tests
  • Updated documentation (as needed)
  • Updated migration guide (as needed)
  • Consider updating Python bindings (if the library has them)
  • codecheck passed (See contributing)
  • All tests passed (See test coverage)
  • While waiting for a review on your PR, please help review another open pull request to support the maintainers

Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining Signed-off-by messages.

Previously, the signal handler would interrupt the main thread (thread
that calls `Server::Run`), so there was no chance for the main thread to
resume before the signal handler was finished. With the new signal
handler implementation in gz::common, the signal handler is now called
from a different thread which means the main thread might resume before
the signal handler is finished. This causes a race condition in which
the `Server` class, and therefore the `ServerPrivate` class are being
destructed while the signal handler is still active. Since
`ServerPrivate` also contains the class that encapsulates the signal
handler implementation, the causes a crash as the signal handler tries
to access data that has been deleted.

The fix here is to ensure that the main thread does not exit until the
signal handler is finished by synchronizing on a mutex.

Signed-off-by: Addisu Z. Taddese <[email protected]>
@azeey azeey requested a review from mjcarroll as a code owner August 26, 2024 16:46
@github-actions github-actions bot added the 🏛️ ionic Gazebo Ionic label Aug 26, 2024
@iche033
Copy link
Contributor

iche033 commented Aug 26, 2024

CI caught some new test failures. I was able to reproduce the INTEGRATION_dvl* test failures locally

@azeey azeey added the beta Targeting beta release of upcoming collection label Aug 26, 2024
azeey added 2 commits August 27, 2024 12:27
It's incorrect to decide whether we call SimulationRunner::Stop on the
`running` variable since that could be set to false when after a fixed
number of iterations has been executed. i.e., the running variable does
not really indicate that we're in shutdown mode.

Signed-off-by: Addisu Z. Taddese <[email protected]>
@azeey
Copy link
Contributor Author

azeey commented Aug 27, 2024

CI caught some new test failures. I was able to reproduce the INTEGRATION_dvl* test failures locally

Apparently I was not running tests that need a display because I didn't have glxinfo available on the container I was using. I have pushed a change that fixed INTEGRATION_dvl* for me. I did have some other flaky tests locally, but they seem to be flaky on the main branch as well.

@azeey azeey merged commit 8b1dbda into gazebosim:main Aug 28, 2024
9 checks passed
@azeey azeey deleted the fix_signal_handler_crash branch August 28, 2024 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta Targeting beta release of upcoming collection 🏛️ ionic Gazebo Ionic
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants