Description
I'm using LLamaSharp 0.21.0 (CUDA 12.8 beckend) with the commit llama.cpp:
5783575
Model for Inference (one instance for all users): Qwen2.5-14B-1M-Q5-K-M
Model for Embedding (one instance for all users): Qwen2.5-1.5B-Q5-K-M
All models using 12Gb VRAM.
FlashAttention = true!
Memory on the server: RAM 32Gb - 48Gb.
There is always enough memory (RAM, VRAM) for queries with a margin.
Everything works fine for one user.
When I run 3-4 web requests at the same time, the application crashes with fatal CUDA errors. The errors are about always the same. One request using 3,3Gb VRAM (context size: 16K, nbatch: 2048, ubatch: 512).
I see errors both when using one GPU (RTX 4090) and when using two GPUs (2 x RTX 4090, layer split mode, tensors and VRAM 50/50).
I've read it and apparently this shouldn't be happening:
#3960
#6017
Errors disappear when I add blocking code sections at the LLamaSharp level that perform: creating a context, deleting a context, using a decoder, embedding, clear KV cache, etc.
The LLamaSharp team is aware of the problem, but their code is clean and calls the native API llama.cpp.
The LLamaSharp code has its own partial (not all) resource locking during multithreading.
But at the same time, the release of #3960 was supposed to solve all these problems?
Can you tell me what to do and where the problem might be?
What is the correct approach to use llama.cpp with a GPU in a multithreaded environment?
There may be assembly recommendations llama.cpp for multithreaded mode?
If I understand correctly, then each thread needs its own ggml_backend instance?
Is it possible to create an ggml_backend instance without reloading the model?
Thanks.
CUDA Errors:
2025-02-09 16:44:06.2064 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2064 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_op_mul_mat at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:1615
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:605
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:44:06.2427 LLama.Native.SafeLLamaContextHandle.llama_decode Error: cudaGetLastError()
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.SafeLLamaContextHandle.llama_decode Error: ggml_cuda_compute_forward: ADD failed
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9660 LLama.Native.NativeApi.llama_kv_cache_clear Error: CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: current device: 1, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:607
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: current device: 1, in function ggml_cuda_compute_forward at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda\ggml-cuda.cu:2313
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
2025-02-09 16:48:54.9864 LLama.Native.NativeApi.llama_kv_cache_clear Error: cudaDeviceSynchronize()
SafeLLamaContextHandle.KvCacheClear => NativeApi.llama_kv_cache_clear => NativeApi.llama_kv_cache_clear
2025-02-09 16:48:54.9864 LLama.Native.SafeLLamaContextHandle.llama_decode Error: err
SafeLLamaContextHandle.Decode => SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
libllama (core library)
Problem description & steps to reproduce
- CUDA 12, 1-2 GPUs
- Multiple users requests