Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Deadlock happend while calling MXNDArraySyncCopyToCPU() ? #12923

Closed
@coconutyao

Description

@coconutyao

We have been troubled by the problem for a few days, so we need everyone's help, thank you!

Environment
GPU: Tesla P4; CPU: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz.

Appearance
The program receives the Image data as a server. After a period of time, the program starts to appear similar to Deadlock (may be caused by some requests, but cannot be accurately reproduced)

We tested on mxnet versions 1.0, 1.2, and 1.3, and the program showed the same appearance.

Program running process
We called the python engine in a C++ multithreaded program that uses the mxnet-python api. As can be seen from the stack information, MXNDArraySyncCopyToCPU() waits for a condition variable during execution, and the program will always be stuck in this place.

Stack information
Thread 85 (Thread 0x7f3cba52f700 (LWP 41394)):
#0 0x00007f3d582fd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f3d580979bc in __gthread_cond_wait (__mutex=, __cond=) at /data/home/xxx/gcc-build/gcc-4.9.4/build/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3 0x00007f3c7bcb86d5 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#4 0x00007f3c7bd94b4d in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#5 0x00007f3c7be7e9c3 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#6 0x00007f3c7bc516db in MXNDArraySyncCopyToCPU () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#7 0x00007f3d53e15adc in ffi_call_unix64 () from my_app/libs/./libffi.so.6
#8 0x00007f3d53e15282 in ffi_call () from my_app/libs/./libffi.so.6
#9 0x00007f3bfdd09376 in _call_function_pointer (argcount=3, resmem=0x7f3b3c1c4040, restype=, atypes=, avalues=0x7f3b3c1c4010, pProc=0x7f3c7bc516b0 , flags=4353) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/callproc.c:841
#10 _ctypes_callproc (pProc=0x7f3c7bc516b0 , argtuple=0x7f3b3c1c4130, flags=4353, argtypes=, restype=0x1616b80, checker=0x0) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/callproc.c:1184
#11 0x00007f3bfdd00db3 in PyCFuncPtr_call (self=, inargs=, kwds=0x0) at /home/xxx/minonda/conda-bld/python-2.7_1482296880985/work/Python-2.7.13/Modules/_ctypes/_ctypes.c:3979
#12 0x00007f3d52c42e93 in PyObject_Call (func=0x7f3d2a11a050, arg=, kw=) at Objects/abstract.c:2547
#13 0x00007f3d52cf580d in do_call (nk=, na=, pp_stack=0x7f3b3c1c43b8, func=0x7f3d2a11a050) at Python/ceval.c:4569
#14 call_function (oparg=, pp_stack=0x7f3b3c1c43b8) at Python/ceval.c:4374
#15 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#16 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d3f730030, globals=, locals=, args=, argcount=1, kws=0x7f3d2a186fd0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#17 0x00007f3d52cf71f7 in fast_function (nk=, na=1, n=, pp_stack=0x7f3b3c1c45d8, func=0x7f3d3f6ee5f0) at Python/ceval.c:4447
#18 call_function (oparg=, pp_stack=0x7f3b3c1c45d8) at Python/ceval.c:4372
#19 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#20 0x00007f3d52cf7345 in fast_function (nk=, na=, n=, pp_stack=0x7f3b3c1c4748, func=0x7f3d2aea9c80) at Python/ceval.c:4437
#21 call_function (oparg=, pp_stack=0x7f3b3c1c4748) at Python/ceval.c:4372
#22 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#23 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d528fcc30, globals=, locals=, args=, argcount=2, kws=0x7f3d2a18dc68, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#24 0x00007f3d52cf71f7 in fast_function (nk=, na=2, n=, pp_stack=0x7f3b3c1c4968, func=0x7f3d2a33f0c8) at Python/ceval.c:4447
#25 call_function (oparg=, pp_stack=0x7f3b3c1c4968) at Python/ceval.c:4372
#26 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#27 0x00007f3d52cf7345 in fast_function (nk=, na=, n=, pp_stack=0x7f3b3c1c4ad8, func=0x7f3d2a33f410) at Python/ceval.c:4437
#28 call_function (oparg=, pp_stack=0x7f3b3c1c4ad8) at Python/ceval.c:4372
#29 PyEval_EvalFrameEx (f=, throwflag=) at Python/ceval.c:2989
#30 0x00007f3d52cf7c3e in PyEval_EvalCodeEx (co=0x7f3d52963db0, globals=, locals=, args=, argcount=1, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3584
#31 0x00007f3d52c72a61 in function_call (func=0x7f3d2a33f8c0, arg=0x7f3d529377d0, kw=0x0) at Objects/funcobject.c:523
#32 0x00007f3d52c42e93 in PyObject_Call (func=0x7f3d2a33f8c0, arg=, kw=) at Objects/abstract.c:2547
#33 0x00007f3d52ced7b3 in PyEval_CallObjectWithKeywords (func=0x7f3d2a33f8c0, arg=0x7f3d529377d0, kw=) at Python/ceval.c:4221
#34 0x00007f3d52d13468 in PyEval_CallMethod (obj=, methodname=, format=) at Python/modsupport.c:612
#35 0x00007f3d5303141f in ?? ()
#36 0x0000000000000000 in ?? ()


In addition:
there are occasions when other threads are blocked at the same time, such as the stack information below, which is the stack information of an unrelated CPU thread. The strange thing is that there is actually libmxnet.so:

Thread 70 (Thread 0x7f3b0bff6700 (LWP 41409)):
#0 0x00007f3d582fd6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f3d580979bc in __gthread_cond_wait (__mutex=, __cond=) at /data/home/xxx/gcc-build/gcc-4.9.4/build/x86_64-redhat-linux/libstdc++-v3/include/x86_64-redhat-linux/bits/gthr-default.h:864
#2 std::condition_variable::wait (this=, __lock=...) at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:52
#3 0x00007f3c7bcb88a3 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#4 0x00007f3c7bcc0339 in ?? () from my_app/anaconda2/lib/python2.7/site-packages/mxnet/libmxnet.so
#5 0x00007f3d577c4702 in fork () from /lib64/libc.so.6
......

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions