This repository was archived by the owner on Jul 1, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Sync BatchNorm is an expensive operation when we try to synchronize across multiple nodes. Adding the option of specifying the group size for the sync op. A value of 8 would mean that the sync only happens intra node (since the ranks are aligned in multiples of 8) for an 8 GPU per node setup.
This only works with Apex Sync BN. When trying it with PyTorch with a group size of 8, I was getting connection reset errors inside
init_distributed_data_parallel_model
. When the group size equalled the total number of GPUs, I didn't get issues.I think we have too many options at the config level for distributed settings. Moving them to a separate section in a follow up diff.
Differential Revision: D21868629