Fix device assignment in get_device_name
for distributed training
#3303
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR updates the
sentence_transformers.util.get_device_name
utility to better support multi-GPU setups using tools likeaccelerate
andtorchrun
.Context
This issue in Pylate shows a small problem when not explicitly providing a device for a
SentenceTransformer
model, and launching a training run withaccelerate
ortorchrun
: Multiple unexpected processes with low VRAM usage remains on the same GPU.It seems to happen because in the
SentenceTransformer
constructor, theget_device_name
function setscuda
as the device for every rank by default, which causes multiple processes remain oncuda:0
even afteraccelerate
distributes the model across all GPUs.Even if the script runs fine and performance doesn't seem to be impacted, it's still VRAM on GPU 0 that could be better utilized.
This can be reproduced even when using a pure
sentence_transformers
script like this one: training_gooaq_lora.py. The same behavior happens when launching with:or
(even when the code is properly wrapped in a
main()
block.)Proposed Fix
We can update the
get_device_name()
function to:torch.distributed.get_rank()
when distributed training is initialized.LOCAL_RANK
from the environment and resolve tocuda:{LOCAL_RANK}
."cpu"
,"mps"
,"npu"
, or"hpu"
as before.This ensures that by default, the correct GPU device is used per process, even when a model is set with
device=None
.Note: This shouldn't change the behavior when launching as usual with
python script.py
, since if no local rank is found, it will default tocuda:0
.cc @NohTow