Skip to content

vulkan: lock accesses of pinned_memory vector #14333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 28, 2025

Conversation

jeffbolznv
Copy link
Collaborator

I have not been able to reproduce a crash in test-thread-safety, but I did see WARNING: failed to free pinned memory: memory not in map which points to lack of locking accesses to the pinned memory array.

@jeffbolznv jeffbolznv requested a review from 0cc4m June 22, 2025 20:47
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 22, 2025
@bandoti
Copy link
Collaborator

bandoti commented Jun 23, 2025

@jeffbolznv Not sure if this is related, but the ubuntu-22-cmake-vulkan is seeing a CI test failure in test-thread-safety. This might provide a pathway to reproduce it—running on Ubuntu that is.

I will try and run the test locally and see what happens for me out of curiosity—though perhaps not of any help to fixing the issue. 😅

@jeffbolznv
Copy link
Collaborator Author

I haven't been able to reproduce the crash using lavapipe on WSL. If somebody is able to get a callstack where it crashes, that would be helpful.

@bandoti
Copy link
Collaborator

bandoti commented Jun 23, 2025

Waiting to see if that Ubuntu 22 run fails again—and if so, I will create a temporary workflow that uploads all its assets and a crash dump.

@bandoti
Copy link
Collaborator

bandoti commented Jun 23, 2025

It succeeded this time so looks a bit random. It would be great if the CI uploads artifacts and core dump when that happens. I can give that a go but it'll probably take me a couple days working a little bit at a time.

@bandoti
Copy link
Collaborator

bandoti commented Jun 23, 2025

I'm thinking valgrind might be useful to help us catch this one.

@bandoti
Copy link
Collaborator

bandoti commented Jun 23, 2025

@jeffbolznv I managed to create a crash dump on the master branch. Because the file is so large, I just did a backtrace on all the threads to start. Here is the gist file with details.

https://gist.github.com/bandoti/b229ea2a8fced26f01b112834c826150

I used WSL2 and ran the following commands (mimicking the CI test):

git clone [email protected]:bandoti/llama.cpp.git
cd llama.cpp/
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Debug
cmake --build build --config Debug
export GGML_VK_VISIBLE_DEVICES=0
./build/bin/test-thread-safety -hf ggml-org/models -hff tinyllamas/stories15M-q4_0.gguf -ngl 99 -p "The meaning of life is" -n 128 -c 256 -ub 32 -np 4

The crash occurred the first time, but not the second time I ran the executable. If you would like more info from the crash dump, please let me know and I'll fetch it with gdb.

@jeffbolznv
Copy link
Collaborator Author

Did it say which thread crashed? Thread 1 is hitting the issue this PR fixes, so maybe it's the same bug.

I tried running with valgrind. There are a ton of warnings coming from lavapipe's use of LLVM. I'm not sure whether those are legitimate, but it does make me thing we ought to mutex ggml_vk_load_shaders since it's setting all the pipeline pointers in the device. I'll do that in a separate change.

@bandoti
Copy link
Collaborator

bandoti commented Jun 23, 2025

Yeah, thread 1 is crashing. I should probably pull this branch instead—but only saw the crash on the master so far.

I find that AI is really great when it comes to parsing gdb/valgrind output! 😊

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I wasn't aware of recursive mutexes, that's a cool feature. Always good to learn new stuff.

@0cc4m 0cc4m merged commit 00d5282 into ggml-org:master Jun 28, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants