Skip to content
This repository was archived by the owner on Jul 1, 2024. It is now read-only.

Support sync batch norm group size #534

Closed

Conversation

mannatsingh
Copy link
Contributor

Summary:
Sync BatchNorm is an expensive operation when we try to synchronize across multiple nodes. Adding the option of specifying the group size for the sync op. A value of 8 would mean that the sync only happens intra node (since the ranks are aligned in multiples of 8) for an 8 GPU per node setup.

This only works with Apex Sync BN. When trying it with PyTorch with a group size of 8, I was getting connection reset errors inside init_distributed_data_parallel_model. When the group size equalled the total number of GPUs, I didn't get issues.

I think we have too many options at the config level for distributed settings. Moving them to a separate section in a follow up diff.

Differential Revision: D21868629

Summary:
Sync BatchNorm is an expensive operation when we try to synchronize across multiple nodes. Adding the option of specifying the group size for the sync op. A value of 8 would mean that the sync only happens intra node (since the ranks are aligned in multiples of 8) for an 8 GPU per node setup.

This only works with Apex Sync BN. When trying it with PyTorch with a group size of 8, I was getting connection reset errors inside `init_distributed_data_parallel_model`. When the group size equalled the total number of GPUs, I didn't get issues.

I think we have too many options at the config level for distributed settings. Moving them to a separate section in a follow up diff.

Differential Revision: D21868629

fbshipit-source-id: 0983ae702329c716618407a332e67c00838a3b17
@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Jun 3, 2020
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D21868629

facebook-github-bot pushed a commit that referenced this pull request Jun 4, 2020
Summary:
Pull Request resolved: #534

Sync BatchNorm is an expensive operation when we try to synchronize across multiple nodes. Adding the option of specifying the group size for the sync op. A value of 8 would mean that the sync only happens intra node (since the ranks are aligned in multiples of 8) for an 8 GPU per node setup.

This only works with Apex Sync BN. When trying it with PyTorch with a group size of 8, I was getting connection reset errors inside `init_distributed_data_parallel_model`. When the group size equalled the total number of GPUs, I didn't get issues.

I think we have too many options at the config level for distributed settings. Moving them to a separate section in a follow up diff.

Reviewed By: vreis

Differential Revision: D21868629

fbshipit-source-id: e659d8339a2d03a6cb4be2eb9599223583191c2b
@mannatsingh mannatsingh deleted the export-D21868629 branch June 29, 2020 21:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants