Skip to content

Avoid deadlock between transport and transaction #4453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nanangizz
Copy link
Member

To fix #4442.

Call stack trace:

Thread 43 (Thread 0x7f361bc886c0 (LWP 5040)):
#4  0x00007f36a2180fb5 in pj_mutex_lock (mutex=0x7f366916c708) at ../src/pj/os_core_unix.c:1447
#5  0x00007f36a218ac72 in pj_lock_acquire (lock=0x7f366916c6e0) at ../src/pj/lock.c:179
#6  0x00007f36a218afe9 in grp_lock_acquire (p=0x7f366916c5f8) at ../src/pj/lock.c:297
#7  0x00007f36a218b5e4 in pj_grp_lock_acquire (grp_lock=0x7f366916c5f8) at ../src/pj/lock.c:486
#8  0x00007f36a20daf5e in transport_callback (token=0x7f36611f2c08, tdata=0x7f366a268ee8, sent=728) at ../src/pjsip/sip_transaction.c:2284
#9  0x00007f36a20c01f1 in transport_send_callback (transport=0x7f366d27e368, token=0x7f366a268ee8, size=728) at ../src/pjsip/sip_transport.c:903
#10 0x00007f36a20c9d14 in on_data_sent (asock=0x7f366d147bf8, op_key=0x7f366a268f40, bytes_sent=728) at ../src/pjsip/sip_transport_tcp.c:1236
#11 0x00007f36a218628c in ioqueue_on_write_complete (key=0x55c596f14048, op_key=0x7f366a268f40, bytes_sent=728) at ../src/pj/activesock.c:811
#12 0x00007f36a217bc61 in ioqueue_dispatch_write_event (ioqueue=0x7f3650317130, h=0x55c596f14048) at ../src/pj/ioqueue_common_abs.c:409
#13 0x00007f36a217ed3b in pj_ioqueue_poll (ioqueue=0x7f3650317130, timeout=0x7f361bc87d90) at ../src/pj/ioqueue_epoll.c:1005


Thread 56 (Thread 0x7f36190646c0 (LWP 26743)):
#4  0x00007f36a2180fb5 in pj_mutex_lock (mutex=0x7f366c904eb8) at ../src/pj/os_core_unix.c:1447
#5  0x00007f36a218ac72 in pj_lock_acquire (lock=0x7f366c904e90) at ../src/pj/lock.c:179
#6  0x00007f36a218afe9 in grp_lock_acquire (p=0x7f366c904da8) at ../src/pj/lock.c:297
#7  0x00007f36a218b5e4 in pj_grp_lock_acquire (grp_lock=0x7f366c904da8) at ../src/pj/lock.c:486
#8  0x00007f36a217d380 in pj_ioqueue_lock_key (key=0x55c596f14048) at ../src/pj/ioqueue_common_abs.c:1394
#9  0x00007f36a217c810 in pj_ioqueue_send (key=0x55c596f14048, op_key=0x7f366a396280, data=0x7f3668716d98, length=0x7f361905ff30, flags=0) at ../src/pj/ioqueue_common_abs.c:951
#10 0x00007f36a2185fb1 in pj_activesock_send (asock=0x7f366d147bf8, send_key=0x7f366a396280, data=0x7f3668716d98, size=0x7f361905ff30, flags=0) at ../src/pj/activesock.c:718
#11 0x00007f36a20ca069 in tcp_send_msg (transport=0x7f366d27e368, tdata=0x7f366a396228, rem_addr=0x7f366a396418, addr_len=16, token=0x7f366a396228, callback=0x7f36a20c0155 <transport_send_callback>) at ../src/pjsip/sip_transport_tcp.c:1341
#12 0x00007f36a20c048e in pjsip_transport_send (tr=0x7f366d27e368, tdata=0x7f366a396228, addr=0x7f366a396418, addr_len=16, token=0x7f3668726db8, cb=0x7f36a20ba856 <stateless_send_transport_cb>) at ../src/pjsip/sip_transport.c:979
#13 0x00007f36a20bae66 in stateless_send_transport_cb (token=0x7f3668726db8, tdata=0x7f366a396228, sent=-70002) at ../src/pjsip/sip_util.c:1304
#14 0x00007f36a20bb372 in stateless_send_resolver_callback (status=0, token=0x7f3668726db8, addr=0x7f3619060110) at ../src/pjsip/sip_util.c:1439
#15 0x00007f3653f08d27 in sip_resolve (resolver=0x55c597267658, pool=0x7f366a396180, target=0x7f3619063550, token=0x7f3668726db8, cb=0x7f36a20bafb7 <stateless_send_resolver_callback>) at res_pjsip/pjsip_resolver.c:525
#16 0x00007f36a20bd336 in pjsip_resolve (resolver=0x55c597267658, pool=0x7f366a396180, target=0x7f3619063550, token=0x7f3668726db8, cb=0x7f36a20bafb7 <stateless_send_resolver_callback>) at ../src/pjsip/sip_resolve.c:222
#17 0x00007f36a20b7a2f in pjsip_endpt_resolve (endpt=0x55c59767a928, pool=0x7f366a396180, target=0x7f3619063550, token=0x7f3668726db8, cb=0x7f36a20bafb7 <stateless_send_resolver_callback>) at ../src/pjsip/sip_endpoint.c:1205
#18 0x00007f36a20bb5a2 in pjsip_endpt_send_request_stateless (endpt=0x55c59767a928, tdata=0x7f366a396228, token=0x7f36602430b8, cb=0x7f36a20da7f0 <send_msg_callback>) at ../src/pjsip/sip_util.c:1508
#19 0x00007f36a20db761 in tsx_send_msg (tsx=0x7f36602430b8, tdata=0x7f366a396228) at ../src/pjsip/sip_transaction.c:2518
#20 0x00007f36a20dc22e in tsx_on_state_null (tsx=0x7f36602430b8, event=0x7f3619063670) at ../src/pjsip/sip_transaction.c:2788
#21 0x00007f36a20da6db in pjsip_tsx_send_msg (tsx=0x7f36602430b8, tdata=0x7f366a396228) at ../src/pjsip/sip_transaction.c:2020
#22 0x00007f36a20e1c08 in pjsip_dlg_send_request (dlg=0x7f363c632bc8, tdata=0x7f366a396228, mod_data_id=-1, mod_data=0x0) at ../src/pjsip/sip_dialog.c:1400
#23 0x00007f36a2097094 in pjsip_evsub_send_request (sub=0x7f363c54da48, tdata=0x7f366a396228) at ../src/pjsip-simple/evsub.c:1425
#24 0x00007f36a2098a32 in on_tsx_state_uas (sub=0x7f363c54da48, tsx=0x7f36611f2c08, event=0x7f3619063960) at ../src/pjsip-simple/evsub.c:2240
#25 0x00007f36a2098da9 in mod_evsub_on_tsx_state (tsx=0x7f36611f2c08, event=0x7f3619063960) at ../src/pjsip-simple/evsub.c:2346
#26 0x00007f36a20e397e in pjsip_dlg_on_tsx_state (dlg=0x7f363c632bc8, tsx=0x7f36611f2c08, e=0x7f3619063960) at ../src/pjsip/sip_dialog.c:2252
#27 0x00007f36a20e4471 in mod_ua_on_tsx_state (tsx=0x7f36611f2c08, e=0x7f3619063960) at ../src/pjsip/sip_ua_layer.c:186
#28 0x00007f36a20d9209 in tsx_set_state (tsx=0x7f36611f2c08, state=PJSIP_TSX_STATE_TRYING, event_src_type=PJSIP_EVENT_RX_MSG, event_src=0x7f366d7a9d38, flag=0) at ../src/pjsip/sip_transaction.c:1457
#29 0x00007f36a20dc13f in tsx_on_state_null (tsx=0x7f36611f2c08, event=0x7f3619063a10) at ../src/pjsip/sip_transaction.c:2761
#30 0x00007f36a20da7d3 in pjsip_tsx_recv_msg (tsx=0x7f36611f2c08, rdata=0x7f366d7a9d38) at ../src/pjsip/sip_transaction.c:2058
#31 0x00007f36a20e2f2b in pjsip_dlg_on_rx_request (dlg=0x7f363c632bc8, rdata=0x7f366d7a9d38) at ../src/pjsip/sip_dialog.c:1882
#32 0x00007f36a20e54ac in mod_ua_on_rx_request (rdata=0x7f366d7a9d38) at ../src/pjsip/sip_ua_layer.c:745
#33 0x00007f36a20b73e4 in pjsip_endpt_process_rx_data (endpt=0x55c59767a928, rdata=0x7f366d7a9d38, p=0x7f3653f35ce0 <param>, p_handled=0x7f3619063c34) at ../src/pjsip/sip_endpoint.c:934

Possible deadlock scenario:

  1. Thread 1:
    a. receive SUBSCRIBE request for subscription refresh
    b. new SUBSCRIBE tsx is created and its lock is aquired
    c. send response asynchronously
    d. context switch happens before sending NOTIFY.
  2. Thread 2:
    a. SUBSCRIBE response sent callback is invoked, note that transport/ioqueue-key lock is being acquired
    b. wait for SUBSCRIBE tsx lock being held by thread 1.
  3. Thread 1: trying to continue sending NOTIFY, but cannot acquire the transport lock.

The idea here is by modifying the flow in thread 2, i.e: instead of waiting for SUBSCRIBE tsx lock, it delays the processing (via timer) whenever the tsx lock cannot be acquired immediately.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a potential deadlock issue by modifying the transport callback flow in the SIP transaction layer. Key changes include:

  • Introducing a new timer-based handling mechanism (misc_timer) to delay processing when the transaction lock cannot be acquired immediately.
  • Refactoring the transport callback into a separate implementation function (transport_callback_impl) and adjusting group lock usage.
  • Extending the transaction structure to include a miscellaneous timer for managing delayed callbacks.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pjsip/src/pjsip/sip_transaction.c Refactored transport_callback to avoid deadlock by delaying callback processing via timer.
pjsip/include/pjsip/sip_transaction.h Added a new timer entry (misc_timer) to pjsip_transaction structure.
Comments suppressed due to low confidence (1)

pjsip/src/pjsip/sip_transaction.c:2278

  • The transport_callback_impl function acquires the group lock even though the caller already holds it (using tryacquire), which can lead to double-locking issues if the lock is not reentrant. Consider refactoring the locking so that transport_callback_impl assumes the lock is already held or remove the tryacquire in transport_callback.
pj_grp_lock_acquire(tsx->grp_lock);

@dotcommmm
Copy link

@nanangizz Thanks a lot for your proposed fix... Much appreciate your prompt reaction.
Just wondering (no pressure), if this is reviewed by the other members, when do you expect to release version 2.16? :)

Copy link
Member

@sauwming sauwming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is a very difficult issue that has been troubling us for quite some time.

@nanangizz
Copy link
Member Author

@nanangizz Thanks a lot for your proposed fix... Much appreciate your prompt reaction. Just wondering (no pressure), if this is reviewed by the other members, when do you expect to release version 2.16? :)

@dotcommmm Yes, this PR needs to be reviewed before it can be merged. We can't guarantee anything regarding the release timeline, but it's typically before the end of the year. Btw, we'd appreciate it if you could somehow test this PR or participate in the review :)

@dotcommmm
Copy link

@nanangizz I have some instances where the issue occurs randomly. Question is more: How can I test/compile this fix into my Asterisk? (when a new release is not yet available)

@silentindark
Copy link
Contributor

@dotcommmm Save the patch (https://github.com/pjsip/pjproject/commit/2a365a139d0c4459c62d9640937a3cdf70928f30.patch) with following name 0020-Avoid_deadlock_between_transport_and_transaction.patch and copy into asterisk_source_code_dir/third-party/pjproject/patches/
And then recompile asterisk

@dotcommmm
Copy link

dotcommmm commented Jun 5, 2025

@silentindark Cool.. thanks, will have a try later on :)!

@nanangizz we just experienced a new deadlock (net yet applied the patch). I'm adding the output from "thread apply all bt" here.
Not sure if you can verify if here was the same issue?
gdb_05062025.log

If not, I will create a fresh issue.

PS: Output of "core show locks" in my Asterisk system:
locks_output_05062025.txt

@nanangizz
Copy link
Member Author

@nanangizz we just experienced a new deadlock (net yet applied the patch). I'm adding the output from "thread apply all bt" here. Not sure if you can verify if here was the same issue? gdb_05062025.log

From a quick look, it is perhaps the same issue, thread 28 transaction's transport_callback() waiting for mutex 0x7fd12582b058, while other threads waiting for transport mutex 0x7fd11c0862c8.

@dotcommmm
Copy link

@silentindark After placing the file here:
/usr/src/asterisk-22.3.0/third-party/pjproject/patches/0020-Avoid_deadlock_between_transport_and_transaction.patch

... and firing the following commands:
cd /usr/src/asterisk-22.3.0/
make clean
./configure
make
make install
service asterisk restart

... Is there any way for me to check/verify if the patch is applied?

(Sorry, first time I'm doing this.)

@silentindark
Copy link
Contributor

silentindark commented Jun 7, 2025

@dotcommmm When you run command ./configure you will see output similar to this
...
[pjproject] Applying patches /usr/src/asterisk-22.3.0/third-party/pjproject/patches /usr/src/asterisk-22.3.0/third-party/pjproject/source
[pjproject] Applying user.mak
[pjproject] Rebuilding
[pjproject] Applying custom include file patches/asterisk_malloc_debug.h
[pjproject] Applying custom include file patches/config_site.h
...
I think that better to use "make distclean" instead of "make clean"

So try
cd /usr/src/asterisk-22.3.0/
make distclean
./configure --with-pjproject-bundled
make
make install
service asterisk restart

@dotcommmm
Copy link

@silentindark After using "make distclean", I indeed see "Applying patches ..." in my console.
Again your help is much appreciated, thanks!

I applied the patch to 3 instances in the meanwhile. Should the issue occur again, I'll update this pull for sure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Deadlock] PJSIP TCP transport deadlock in pj_ioqueue_send() on PJPROJECT 2.15.1 (used with Asterisk)
5 participants