-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Terrible Deadlock issue #937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Which version of OpenBLAS and what operating system ? Did you build with OpenMP support ? |
@martin-frbg Thank you for reply. I see the FAQ, is that mean if I have a multi-thread application, I can only use OMP_NUM_THREADS=1? |
From my limited understanding (note I am a user, not a developer), to avoid deadlocks you would need to build with OpenMP if both your application and the OpenBLAS it calls will be creating threads (I am not that familiar with numpy, but doesn't it run multithreaded by default ?) |
sched_yield is called by multithreaded openblas only. Please build according to your need- 1 thread version , and with ldb==1 you should call gemv, not gemm |
We just build numpy with OpenBlas and gemm is called by numpy .......It's wrong? And we do want to use multithreaded openblas to speed up our application. |
Update to at least 0.2.16 to get various speed improvements as well as a fix for a thread-safety issue seen in conjunction with numpy - cf. #716 (maybe what caused crashes there may have been the cause of other undesirable behaviour as well). Did you actually see speedup with multithreaded openblas (in the cases where it did not hang), or are you just expecting it to perform better than a singlethreaded version of the library ? |
I really see speedup with multithreaded, it works very very well if not hang. Maybe I should try the latest version as you say. |
Will using the latest version fix the hang? I also met this problem. |
No,we finally set a timeout to our job and restart to avoid it |
Following #1071 you could try if building OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 helps |
What's the USE_SIMPLE_THREADED_LEVEL3=1 mean? Will it weak the performance ? |
First it will solve crashing |
I cannot promise that it will solve the problem, it is just another option to try and work around what appears to be thread contention issues - both implementations of the BLAS threading date back to Kazushige Goto's libGoto2 from about 2009. (I checked the sparse documentation that came with libGoto2-1.08 but found no explanation of either method. libGoto-1.0 from 2006 only had what is now labeled the "simple" implementation) |
We hit a similar locking problem on Ubuntu 16.04, numpy 1.12.0 with OpenBLAS libopenblasp-r0-39a31c03.2.18.so. When computing a Python line:
The program got deadlocked in:
The deadlock (to my knowledge) does not occur deterministically. |
@jusic can you try building your own OpenBLAS from a checkout of the current "develop" branch here ? |
Thanks for the response. We'll see if we can do that. The numpy we were using is the default pre-packaged binary Python wheel, and it contains that openblas version. |
@martin-frbg I swapped the precompiled OpenBLAS binary in the said numpy to a compiled one from develop branch (libopenblas_haswellp-r0.2.20.dev.so, 12e476f git hash) and we haven't seen a lockup yet in our program. It took an hour or two before and now I've been running something like four hours without a deadlock. Not a very conclusive test, but at least indicative that the fix might be working. |
Now it has been running without a deadlock for 18 hours+. Seems good. |
@martin-frbg We've had much more mileage with the OpenBLAS library compiled from the develop branch (12e476f), and so far haven't seen any lockups in our program. Thanks! |
Hello, we find a terrible deadlock on openblas, trace info:
81 in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0 0x00007f79e0c5e2a7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007f79de5aa735 in inner_thread () from /opt/OpenBLAS/lib/libopenblas.so.0
#2 0x00007f79de6b5369 in exec_blas () from /opt/OpenBLAS/lib/libopenblas.so.0
#3 0x00007f79de5aadde in gemm_driver.constprop.0 () from /opt/OpenBLAS/lib/libopenblas.so.0
#4 0x00007f79de5aae55 in dgemm_thread_tt () from /opt/OpenBLAS/lib/libopenblas.so.0
#5 0x00007f79de4c761e in cblas_dgemm () from /opt/OpenBLAS/lib/libopenblas.so.0
#6 0x00007f79df40bbf0 in gemm (typenum=typenum@entry=12, transA=, transB=, m=, n=, k=, A=A@entry=0x7f790dc0e850,
#7 0x00007f79df40c506 in cblas_matrixproduct (typenum=typenum@entry=12, ap1=0x7f790dc0e850, ap1@entry=0x7f790dc0ecb0, ap2=0x7f790dc0e940, ap2@entry=0x7f790dc0e2b0, out=out@entry=0x7f78f5818530)
#8 0x00007f79df3e16b4 in PyArray_MatrixProduct2 (op1=, op2=<numpy.ndarray at remote 0x7f790dc0e2b0>, out=0x7f78f5818530) at numpy/core/src/multiarray/multiarraymodule.c:989
#9 0x00007f79df3e479c in array_matrixproduct (__NPY_UNUSED_TAGGEDdummy=,
We just run the job and find it block on sched_yield for 80 hour ... = =
We can't reproduce the problem every time but it always happen when I run enough jobs.
Our cpu is Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
Anyone can help ?
The text was updated successfully, but these errors were encountered: