Skip to content

PreTrain-T5-[get_samples_mapping]-[ RuntimeError: Socket Timeout ] #109

Closed
@Hanlard

Description

@Hanlard

In the megatron\data\dataset_utils.py "get_samples_mapping" function, the following line had run for 12 hours.
samples_mapping = helpers.build_mapping(..)

And this line only runs for rank=0, so the next line torch.distributed.all_reduce(counts, ... ) reports [ RuntimeError: Socket Timeout ]

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleNo activity in 60 days on issue or PR

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions