-
Notifications
You must be signed in to change notification settings - Fork 864
Avoid deadlock between transport and transaction #4453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a potential deadlock issue by modifying the transport callback flow in the SIP transaction layer. Key changes include:
- Introducing a new timer-based handling mechanism (misc_timer) to delay processing when the transaction lock cannot be acquired immediately.
- Refactoring the transport callback into a separate implementation function (transport_callback_impl) and adjusting group lock usage.
- Extending the transaction structure to include a miscellaneous timer for managing delayed callbacks.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
pjsip/src/pjsip/sip_transaction.c | Refactored transport_callback to avoid deadlock by delaying callback processing via timer. |
pjsip/include/pjsip/sip_transaction.h | Added a new timer entry (misc_timer) to pjsip_transaction structure. |
Comments suppressed due to low confidence (1)
pjsip/src/pjsip/sip_transaction.c:2278
- The transport_callback_impl function acquires the group lock even though the caller already holds it (using tryacquire), which can lead to double-locking issues if the lock is not reentrant. Consider refactoring the locking so that transport_callback_impl assumes the lock is already held or remove the tryacquire in transport_callback.
pj_grp_lock_acquire(tsx->grp_lock);
@nanangizz Thanks a lot for your proposed fix... Much appreciate your prompt reaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is a very difficult issue that has been troubling us for quite some time.
@dotcommmm Yes, this PR needs to be reviewed before it can be merged. We can't guarantee anything regarding the release timeline, but it's typically before the end of the year. Btw, we'd appreciate it if you could somehow test this PR or participate in the review :) |
@nanangizz I have some instances where the issue occurs randomly. Question is more: How can I test/compile this fix into my Asterisk? (when a new release is not yet available) |
@dotcommmm Save the patch (https://github.com/pjsip/pjproject/commit/2a365a139d0c4459c62d9640937a3cdf70928f30.patch) with following name 0020-Avoid_deadlock_between_transport_and_transaction.patch and copy into asterisk_source_code_dir/third-party/pjproject/patches/ |
@silentindark Cool.. thanks, will have a try later on :)! @nanangizz we just experienced a new deadlock (net yet applied the patch). I'm adding the output from "thread apply all bt" here. If not, I will create a fresh issue. PS: Output of "core show locks" in my Asterisk system: |
From a quick look, it is perhaps the same issue, thread 28 transaction's |
@silentindark After placing the file here: ... and firing the following commands: ... Is there any way for me to check/verify if the patch is applied? (Sorry, first time I'm doing this.) |
@dotcommmm When you run command ./configure you will see output similar to this So try |
@silentindark After using "make distclean", I indeed see "Applying patches ..." in my console. I applied the patch to 3 instances in the meanwhile. Should the issue occur again, I'll update this pull for sure :) |
To fix #4442.
Call stack trace:
Possible deadlock scenario:
a. receive SUBSCRIBE request for subscription refresh
b. new SUBSCRIBE tsx is created and its lock is aquired
c. send response asynchronously
d. context switch happens before sending NOTIFY.
a. SUBSCRIBE response sent callback is invoked, note that transport/ioqueue-key lock is being acquired
b. wait for SUBSCRIBE tsx lock being held by thread 1.
The idea here is by modifying the flow in thread 2, i.e: instead of waiting for SUBSCRIBE tsx lock, it delays the processing (via timer) whenever the tsx lock cannot be acquired immediately.