Open
Description
Description
The default timeout for c10d rdzv is 60 seconds. The more nodes are used to run a job - the more unlikely it is to get them aligned in time for this time window. Luckily torchrun provide a way to control this: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L443
Motivation/Background
Unfortunately there's no env variable that could replace this.
Detailed Proposal
Simply add a new option next to rdzv_backend
and rdzv_port
Alternatives
Create and maintain a custom component.
Additional context/links
Metadata
Metadata
Assignees
Labels
No labels