Fix device assignment in `get_device_name` for distributed training #3303

uminaty · 2025-04-02T15:17:10Z

This PR updates the sentence_transformers.util.get_device_name utility to better support multi-GPU setups using tools like accelerate and torchrun.

Context

This issue in Pylate shows a small problem when not explicitly providing a device for a SentenceTransformer model, and launching a training run with accelerate or torchrun: Multiple unexpected processes with low VRAM usage remains on the same GPU.

It seems to happen because in the SentenceTransformer constructor, the get_device_name function sets cuda as the device for every rank by default, which causes multiple processes remain on cuda:0 even after accelerate distributes the model across all GPUs.

Even if the script runs fine and performance doesn't seem to be impacted, it's still VRAM on GPU 0 that could be better utilized.

This can be reproduced even when using a pure sentence_transformers script like this one: training_gooaq_lora.py. The same behavior happens when launching with:

torchrun --nproc_per_node=8 training_gooaq_lora.py

or

accelerate launch --num_processes 8 training_gooaq_lora.py

(even when the code is properly wrapped in a main() block.)

Proposed Fix

We can update the get_device_name() function to:

Use torch.distributed.get_rank() when distributed training is initialized.
Otherwise, check for LOCAL_RANK from the environment and resolve to cuda:{LOCAL_RANK}.
Fall back to "cpu", "mps", "npu", or "hpu" as before.

This ensures that by default, the correct GPU device is used per process, even when a model is set with device=None.

Note: This shouldn't change the behavior when launching as usual with python script.py, since if no local rank is found, it will default to cuda:0.

cc @NohTow

Fix get_device_name for distributed setup. Fix get_device_name for distributed setup.

sentence_transformers/util.py

tomaarsen · 2025-04-03T08:56:22Z

(Feel free to ignore the CI failures, those are unrelated)

Co-authored-by: Tom Aarsen <[email protected]>

uminaty · 2025-04-03T10:01:27Z

Thanks for the review! Let me know if anything else is needed

tomaarsen · 2025-04-03T10:16:02Z

I think we're all set! Thank you for tackling this, I think this is a really solid default that I definitely should have already implemented ages ago.

Tom Aarsen

Fix get_device_name for distributed setup

a33030b

Fix get_device_name for distributed setup. Fix get_device_name for distributed setup.

tomaarsen reviewed Apr 3, 2025

View reviewed changes

sentence_transformers/util.py Outdated Show resolved Hide resolved

Clean up local_rank assignment logic in get_device_name

6694a25

Co-authored-by: Tom Aarsen <[email protected]>

tomaarsen merged commit 07f53c5 into UKPLab:master Apr 3, 2025
1 of 9 checks passed

uminaty mentioned this pull request Apr 3, 2025

Unexpected multiple processes on GPU 0 with torchrun or accelerate lightonai/pylate#105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix device assignment in `get_device_name` for distributed training #3303

Fix device assignment in `get_device_name` for distributed training #3303

uminaty commented Apr 2, 2025

tomaarsen commented Apr 3, 2025

uminaty commented Apr 3, 2025

tomaarsen commented Apr 3, 2025

Fix device assignment in get_device_name for distributed training #3303

Fix device assignment in get_device_name for distributed training #3303

Conversation

uminaty commented Apr 2, 2025

Context

Proposed Fix

tomaarsen commented Apr 3, 2025

uminaty commented Apr 3, 2025

tomaarsen commented Apr 3, 2025

Fix device assignment in `get_device_name` for distributed training #3303

Fix device assignment in `get_device_name` for distributed training #3303