Skip to content

Terrible Deadlock issue #937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wingerted opened this issue Aug 1, 2016 · 20 comments
Closed

Terrible Deadlock issue #937

wingerted opened this issue Aug 1, 2016 · 20 comments

Comments

@wingerted
Copy link

Hello, we find a terrible deadlock on openblas, trace info:

81 in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0 0x00007f79e0c5e2a7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007f79de5aa735 in inner_thread () from /opt/OpenBLAS/lib/libopenblas.so.0
#2 0x00007f79de6b5369 in exec_blas () from /opt/OpenBLAS/lib/libopenblas.so.0
#3 0x00007f79de5aadde in gemm_driver.constprop.0 () from /opt/OpenBLAS/lib/libopenblas.so.0
#4 0x00007f79de5aae55 in dgemm_thread_tt () from /opt/OpenBLAS/lib/libopenblas.so.0
#5 0x00007f79de4c761e in cblas_dgemm () from /opt/OpenBLAS/lib/libopenblas.so.0
#6 0x00007f79df40bbf0 in gemm (typenum=typenum@entry=12, transA=, transB=, m=, n=, k=, A=A@entry=0x7f790dc0e850,

lda=1688, B=B@entry=0x7f790dc0e940, ldb=1, R=R@entry=0x7f78f5818530, order=CblasRowMajor) at numpy/core/src/multiarray/cblasfuncs.c:61

#7 0x00007f79df40c506 in cblas_matrixproduct (typenum=typenum@entry=12, ap1=0x7f790dc0e850, ap1@entry=0x7f790dc0ecb0, ap2=0x7f790dc0e940, ap2@entry=0x7f790dc0e2b0, out=out@entry=0x7f78f5818530)

at numpy/core/src/multiarray/cblasfuncs.c:650

#8 0x00007f79df3e16b4 in PyArray_MatrixProduct2 (op1=, op2=<numpy.ndarray at remote 0x7f790dc0e2b0>, out=0x7f78f5818530) at numpy/core/src/multiarray/multiarraymodule.c:989
#9 0x00007f79df3e479c in array_matrixproduct (__NPY_UNUSED_TAGGEDdummy=,

args=(<numpy.ndarray at remote 0x7f790dc0ecb0>, <numpy.ndarray at remote 0x7f790dc0e2b0>, <numpy.ndarray at remote 0x7f78f5818530>), kwds=0x0)
at numpy/core/src/multiarray/multiarraymodule.c:2237

We just run the job and find it block on sched_yield for 80 hour ... = =
We can't reproduce the problem every time but it always happen when I run enough jobs.
Our cpu is Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

Anyone can help ?

@martin-frbg
Copy link
Collaborator

Which version of OpenBLAS and what operating system ? Did you build with OpenMP support ?

@martin-frbg
Copy link
Collaborator

@wingerted
Copy link
Author

wingerted commented Aug 1, 2016

@martin-frbg Thank you for reply.
We use 0.2.15 version on Ubuntu 14.04.4 LTS . And we haven't build with OpenMP since our job is a single-thread application
And we also use OMP_NUM_THREADS=2 to make openblas use threads running the job.

I see the FAQ, is that mean if I have a multi-thread application, I can only use OMP_NUM_THREADS=1?
And if I have a single-thread application and want to use OMP_NUM_THREADS=2 , I must compile Openblas with OpenMP=1 ?

@martin-frbg
Copy link
Collaborator

From my limited understanding (note I am a user, not a developer), to avoid deadlocks you would need to build with OpenMP if both your application and the OpenBLAS it calls will be creating threads (I am not that familiar with numpy, but doesn't it run multithreaded by default ?)
Performance may even be better if OpenBLAS is not allowed to create its own sub-threads when it is called from a thread itself, depending on hardware and workload.

@brada4
Copy link
Contributor

brada4 commented Aug 2, 2016

sched_yield is called by multithreaded openblas only. Please build according to your need- 1 thread version , and with ldb==1 you should call gemv, not gemm

@wingerted
Copy link
Author

wingerted commented Aug 2, 2016

We just build numpy with OpenBlas and gemm is called by numpy .......It's wrong?

And we do want to use multithreaded openblas to speed up our application.

@martin-frbg
Copy link
Collaborator

Update to at least 0.2.16 to get various speed improvements as well as a fix for a thread-safety issue seen in conjunction with numpy - cf. #716 (maybe what caused crashes there may have been the cause of other undesirable behaviour as well). Did you actually see speedup with multithreaded openblas (in the cases where it did not hang), or are you just expecting it to perform better than a singlethreaded version of the library ?
(The gemv vs. gemm mentioned by brada4 looks like a missed optimization opportunity in numpy.)

@wingerted
Copy link
Author

I really see speedup with multithreaded, it works very very well if not hang. Maybe I should try the latest version as you say.

@jianqiangyao
Copy link

Will using the latest version fix the hang? I also met this problem.

@wingerted
Copy link
Author

No,we finally set a timeout to our job and restart to avoid it

@martin-frbg
Copy link
Collaborator

Following #1071 you could try if building OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 helps

@wingerted
Copy link
Author

What's the USE_SIMPLE_THREADED_LEVEL3=1 mean? Will it weak the performance ?

@brada4
Copy link
Contributor

brada4 commented Feb 12, 2017

First it will solve crashing
Not considering NUMA nodes hierarchy will reduce performance typically 8-10% on NUMA systems and huge input samples (like by magnitude bigger than NUMA stride of 4..256MB) only for potrf, gemm, syrk and symm, it will have no effect whatsoever on non-NUMA systems (you have NUMA system if it has 2 8-core CPUs)

@martin-frbg
Copy link
Collaborator

I cannot promise that it will solve the problem, it is just another option to try and work around what appears to be thread contention issues - both implementations of the BLAS threading date back to Kazushige Goto's libGoto2 from about 2009. (I checked the sparse documentation that came with libGoto2-1.08 but found no explanation of either method. libGoto-1.0 from 2006 only had what is now labeled the "simple" implementation)

@jusic
Copy link

jusic commented Mar 10, 2017

We hit a similar locking problem on Ubuntu 16.04, numpy 1.12.0 with OpenBLAS libopenblasp-r0-39a31c03.2.18.so. When computing a Python line:

ttf = tf_trans.dot(tf_rot.dot(np.linalg.inv(tf_trans)))

The program got deadlocked in:

(gdb) bt
#0  0x00007fb70cdb2c47 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fb7069e2e75 in exec_blas_async_wait () from /usr/local/lib/python2.7/dist-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#2  0x00007fb7069e2f66 in exec_blas () from /usr/local/lib/python2.7/dist-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#3  0x00007fb7069e147e in gemm_thread_n () from /usr/local/lib/python2.7/dist-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#4  0x00007fb7069f7c6b in dgetrs_N_parallel () from /usr/local/lib/python2.7/dist-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#5  0x00007fb7067d898e in dgesv_ () from /usr/local/lib/python2.7/dist-packages/numpy/core/../.libs/libopenblasp-r0-39a31c03.2.18.so
#6  0x00007fb6f41a5fb3 in call_dgesv (params=0x7ffd8fcdb580) at numpy/linalg/umath_linalg.c.src:1630
#7  DOUBLE_inv (args=0x7fb6836f7db0, dimensions=<optimized out>, steps=<optimized out>, __NPY_UNUSED_TAGGEDfunc=<optimized out>) at numpy/linalg/umath_linalg.c.src:1729
#8  0x00007fb6f493f6c2 in PyUFunc_GeneralizedFunction (ufunc=<optimized out>, args=args@entry=(<numpy.ndarray at remote 0x7fb6833f0ee0>,), 
    kwds=kwds@entry={'extobj': [8192, 1536, <function at remote 0x7fb6f45ef6e0>], 'signature': 'd->d'}, op=<optimized out>) at numpy/core/src/umath/ufunc_object.c:2417
#9  0x00007fb6f4940005 in PyUFunc_GenericFunction (ufunc=ufunc@entry=0x1413c60, args=args@entry=(<numpy.ndarray at remote 0x7fb6833f0ee0>,), 
    kwds=kwds@entry={'extobj': [8192, 1536, <function at remote 0x7fb6f45ef6e0>], 'signature': 'd->d'}, op=op@entry=0x7ffd8fcdd330) at numpy/core/src/umath/ufunc_object.c:2545
#10 0x00007fb6f49414f6 in ufunc_generic_call (ufunc=ufunc@entry=0x1413c60, args=(<numpy.ndarray at remote 0x7fb6833f0ee0>,), 
    kwds={'extobj': [8192, 1536, <function at remote 0x7fb6f45ef6e0>], 'signature': 'd->d'}) at numpy/core/src/umath/ufunc_object.c:4339
#11 0x00000000004b0cb3 in PyObject_Call () at ../Objects/abstract.c:2546
#12 0x00000000004c9faf in do_call (nk=<optimized out>, na=1, pp_stack=0x7ffd8fcdd860, func=<numpy.ufunc at remote 0x1413c60>) at ../Python/ceval.c:4567
#13 call_function (oparg=<optimized out>, pp_stack=0x7ffd8fcdd860) at ../Python/ceval.c:4372
#14 PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#15 0x00000000004c2765 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#16 0x00000000004ca8d1 in fast_function (nk=0, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffd8fcdda70, func=<function at remote 0x7fb6f46071b8>)
    at ../Python/ceval.c:4445
#17 call_function (oparg=<optimized out>, pp_stack=0x7ffd8fcdda70) at ../Python/ceval.c:4370
#18 PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#19 0x00000000004c9d8f in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffd8fcddbc0, func=<function at remote 0x7fb6833e9230>)
    at ../Python/ceval.c:4435
#20 call_function (oparg=<optimized out>, pp_stack=0x7ffd8fcddbc0) at ../Python/ceval.c:4370
#21 PyEval_EvalFrameEx () at ../Python/ceval.c:2987
#22 0x00000000004c2765 in PyEval_EvalCodeEx () at ../Python/ceval.c:3582
#23 0x00000000004c2509 in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:669
#24 0x00000000004f1def in run_mod.lto_priv () at ../Python/pythonrun.c:1376
#25 0x00000000004ec652 in PyRun_FileExFlags () at ../Python/pythonrun.c:1362
#26 0x00000000004eae31 in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:948
#27 0x000000000049e14a in Py_Main () at ../Modules/main.c:640
#28 0x00007fb70cce9830 in __libc_start_main (main=0x49dab0 <main>, argc=4, argv=0x7ffd8fcde008, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd8fcddff8) at ../csu/libc-start.c:291
#29 0x000000000049d9d9 in _start ()
(gdb) 

The deadlock (to my knowledge) does not occur deterministically.

@martin-frbg
Copy link
Collaborator

@jusic can you try building your own OpenBLAS from a checkout of the current "develop" branch here ?
I committed some changes around the beginning of this year that should at least reduce the likelyhood of such conflicts, and from the version number you quote what you are using could be significantly older.
(I could not find a 39a31c03 hash in the archive, if that is not just a cut-and-paste artefact there in the middle of the r0.2.18 version name you quoted - 0.2.18 as such was released almost a year ago)

@jusic
Copy link

jusic commented Mar 14, 2017

Thanks for the response. We'll see if we can do that. The numpy we were using is the default pre-packaged binary Python wheel, and it contains that openblas version.

@jusic
Copy link

jusic commented Mar 14, 2017

@martin-frbg I swapped the precompiled OpenBLAS binary in the said numpy to a compiled one from develop branch (libopenblas_haswellp-r0.2.20.dev.so, 12e476f git hash) and we haven't seen a lockup yet in our program. It took an hour or two before and now I've been running something like four hours without a deadlock. Not a very conclusive test, but at least indicative that the fix might be working.

@jusic
Copy link

jusic commented Mar 15, 2017

Now it has been running without a deadlock for 18 hours+. Seems good.

@jusic
Copy link

jusic commented Apr 4, 2017

@martin-frbg We've had much more mileage with the OpenBLAS library compiled from the develop branch (12e476f), and so far haven't seen any lockups in our program. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants