Closed
Description
In the megatron\data\dataset_utils.py "get_samples_mapping" function, the following line had run for 12 hours.
samples_mapping = helpers.build_mapping(..)
And this line only runs for rank=0, so the next line torch.distributed.all_reduce(counts, ... ) reports [ RuntimeError: Socket Timeout ]