Single-Node Multi-GPU Inferencing For Large Models #14859

evetsagg · 2025-03-15T11:12:30Z

evetsagg
Mar 15, 2025

I tried following the instructions at https://docs.vllm.ai/en/latest/serving/distributed_serving.html for single node multi-gpu inferencing for when the model is too large to fit on a single GPU. I have a 3090 and 3080, but I'm getting OOM this error with the command:

CUDA_VISIBLE_DEVICES=0,1 CUDA_DEVICE_ORDER=PCI_BUS_ID VLLM_ATTENTION_BACKEND=FLASHATTN vllm serve --max-model-len 5000 --max-num-seqs 1 --enable-chunked-prefill --max-num-batched-tokens 512 --gpu-memory-utilization 0.90 --enforce-eager --dtype half --tensor-parallel-size 1 --pipeline-parallel-size 2 Qwen/QwQ-32B-AWQ

"torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB. GPU 0 has a total capacity of 9.65 GiB of which 30.62 MiB is free. Including non-PyTorch memory, this process has 9.36 GiB memory in use. Of the allocated memory 9.01 GiB is allocated by PyTorch, and 19.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)"

I've also tried just setting tensor-parallel-size 2 without --pipeline-parallel-size.

The same command with my 3090 works with --tensor-parallel-size 1, and CUDA_VISIBLE_DEVICES=1 WITHOUT --pipeline-parallel-size even with --max-model-len 15,000. My goal is just to extend the context size further with the 3080. Is this possible?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-Node Multi-GPU Inferencing For Large Models #14859

{{title}}

Replies: 0 comments

Select a reply

Single-Node Multi-GPU Inferencing For Large Models #14859

evetsagg Mar 15, 2025

Replies: 0 comments

evetsagg
Mar 15, 2025