Skip to content

Allow passing rdzv_conf in dist component #1071

Open
@clumsy

Description

@clumsy

Description

The default timeout for c10d rdzv is 60 seconds. The more nodes are used to run a job - the more unlikely it is to get them aligned in time for this time window. Luckily torchrun provide a way to control this: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L443

Motivation/Background

Unfortunately there's no env variable that could replace this.

Detailed Proposal

Simply add a new option next to rdzv_backend and rdzv_port

Alternatives

Create and maintain a custom component.

Additional context/links

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions