You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried following the instructions at https://docs.vllm.ai/en/latest/serving/distributed_serving.html for single node multi-gpu inferencing for when the model is too large to fit on a single GPU. I have a 3090 and 3080, but I'm getting OOM this error with the command:
"torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB. GPU 0 has a total capacity of 9.65 GiB of which 30.62 MiB is free. Including non-PyTorch memory, this process has 9.36 GiB memory in use. Of the allocated memory 9.01 GiB is allocated by PyTorch, and 19.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
I've also tried just setting tensor-parallel-size 2 without --pipeline-parallel-size.
The same command with my 3090 works with --tensor-parallel-size 1, and CUDA_VISIBLE_DEVICES=1 WITHOUT --pipeline-parallel-size even with --max-model-len 15,000. My goal is just to extend the context size further with the 3080. Is this possible?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I tried following the instructions at https://docs.vllm.ai/en/latest/serving/distributed_serving.html for single node multi-gpu inferencing for when the model is too large to fit on a single GPU. I have a 3090 and 3080, but I'm getting OOM this error with the command:
CUDA_VISIBLE_DEVICES=0,1 CUDA_DEVICE_ORDER=PCI_BUS_ID VLLM_ATTENTION_BACKEND=FLASHATTN vllm serve --max-model-len 5000 --max-num-seqs 1 --enable-chunked-prefill --max-num-batched-tokens 512 --gpu-memory-utilization 0.90 --enforce-eager --dtype half --tensor-parallel-size 1 --pipeline-parallel-size 2 Qwen/QwQ-32B-AWQ
"torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB. GPU 0 has a total capacity of 9.65 GiB of which 30.62 MiB is free. Including non-PyTorch memory, this process has 9.36 GiB memory in use. Of the allocated memory 9.01 GiB is allocated by PyTorch, and 19.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"
I've also tried just setting tensor-parallel-size 2 without --pipeline-parallel-size.
The same command with my 3090 works with --tensor-parallel-size 1, and CUDA_VISIBLE_DEVICES=1 WITHOUT --pipeline-parallel-size even with --max-model-len 15,000. My goal is just to extend the context size further with the 3080. Is this possible?
Beta Was this translation helpful? Give feedback.
All reactions