Support sync batch norm group size #534

mannatsingh · 2020-06-03T21:02:38Z

Summary:
Sync BatchNorm is an expensive operation when we try to synchronize across multiple nodes. Adding the option of specifying the group size for the sync op. A value of 8 would mean that the sync only happens intra node (since the ranks are aligned in multiples of 8) for an 8 GPU per node setup.

This only works with Apex Sync BN. When trying it with PyTorch with a group size of 8, I was getting connection reset errors inside init_distributed_data_parallel_model. When the group size equalled the total number of GPUs, I didn't get issues.

I think we have too many options at the config level for distributed settings. Moving them to a separate section in a follow up diff.

Differential Revision: D21868629

Summary: Sync BatchNorm is an expensive operation when we try to synchronize across multiple nodes. Adding the option of specifying the group size for the sync op. A value of 8 would mean that the sync only happens intra node (since the ranks are aligned in multiples of 8) for an 8 GPU per node setup. This only works with Apex Sync BN. When trying it with PyTorch with a group size of 8, I was getting connection reset errors inside `init_distributed_data_parallel_model`. When the group size equalled the total number of GPUs, I didn't get issues. I think we have too many options at the config level for distributed settings. Moving them to a separate section in a follow up diff. Differential Revision: D21868629 fbshipit-source-id: 0983ae702329c716618407a332e67c00838a3b17

facebook-github-bot · 2020-06-03T21:02:58Z

This pull request was exported from Phabricator. Differential Revision: D21868629

Summary: Pull Request resolved: #534 Sync BatchNorm is an expensive operation when we try to synchronize across multiple nodes. Adding the option of specifying the group size for the sync op. A value of 8 would mean that the sync only happens intra node (since the ranks are aligned in multiples of 8) for an 8 GPU per node setup. This only works with Apex Sync BN. When trying it with PyTorch with a group size of 8, I was getting connection reset errors inside `init_distributed_data_parallel_model`. When the group size equalled the total number of GPUs, I didn't get issues. I think we have too many options at the config level for distributed settings. Moving them to a separate section in a follow up diff. Reviewed By: vreis Differential Revision: D21868629 fbshipit-source-id: e659d8339a2d03a6cb4be2eb9599223583191c2b

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Jun 3, 2020

facebook-github-bot closed this Jun 4, 2020

mannatsingh deleted the export-D21868629 branch June 29, 2020 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sync batch norm group size #534

Support sync batch norm group size #534

mannatsingh commented Jun 3, 2020

facebook-github-bot commented Jun 3, 2020

Support sync batch norm group size #534

Support sync batch norm group size #534

Conversation

mannatsingh commented Jun 3, 2020

facebook-github-bot commented Jun 3, 2020