-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Remove scattering for multi-GPU training. #2200
Conversation
- Configuration for training a transformer based bidirectional LM. - Training ongoing with sampled loss currently at 3.8411. - Minor fixes to CnnHighwayEncoder. - LayerNorm was needed instead of MaskedLayerNorm. - Log average batch size during training.
…aset_modifications_3
…aset_modifications_3
…an-ai2/allennlp into lm_without_dataset_modifications_3
@@ -34,7 +34,7 @@ local BASE_ITERATOR = { | |||
// samples in every batch. | |||
"batch_size": 512 * NUM_GPUS, | |||
"sorting_keys": [["source", "num_tokens"]], | |||
"maximum_samples_per_batch": ["num_tokens", NUM_GPUS * 1000] | |||
"maximum_samples_per_batch": ["num_tokens", 2000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a minor backwards compatibility issue here. We're effectively multiplying the batch size (for multi-GPU users) by the number of GPUs. In practice this will result in some OOMs for users that were running close to their memory limits. Given that we had an experimental warning for that use case I think this okay, but I'm curious if you have other thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
fyi, @vidurj you should be able to merge this down if you need it ASAP. |
@brendan-ai2 Do you get better multi GPU utilization through this method? But using fairseq directly makes training faster as expected when using bigger batch sizes. Fairseq uses distributed data parallel with multi processing. Idk what bottlenecks would be in current dataparallel approach we use. I suspect GIL, even though operations are done in cuda, the instructions are from python which might make GIL bottleneck as models like fairseq CNN have lots of code in python. Even torch docs states it might be the case. https://pytorch.org/docs/stable/distributed.html In the single-machine synchronous case, torch.distributed or the torch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over other approaches to data-parallelism, including torch.nn.DataParallel():
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had one minor question, but other than that the code looks fine to me, and better than what was here previously. There's still the issue that @sai-prasanna brings up, but figuring out how to make this faster should be a separate issue (and one that I have no experience with).
Oh, we had a custom scatter_kwargs
function, right? That should be deleted now, shouldn't it? It was broken, anyway - the only reason we made it custom instead of using pytorch's version didn't actually work.
allennlp/training/util.py
Outdated
|
||
used_device_ids = cuda_devices[:len(inputs)] | ||
inputs = [()] * len(batch_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputs
is supposed to be a list of empty tuples? This never gets updated before getting passed to parallel_apply
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to clarify. You can see that ()
was passed to the old scatter_kwargs
as well.
# We pass all our arguments as kwargs. Create a list of empty tuples of the
# correct shape to serve as (non-existent) positional arguments.
@@ -34,7 +34,7 @@ local BASE_ITERATOR = { | |||
// samples in every batch. | |||
"batch_size": 512 * NUM_GPUS, | |||
"sorting_keys": [["source", "num_tokens"]], | |||
"maximum_samples_per_batch": ["num_tokens", NUM_GPUS * 1000] | |||
"maximum_samples_per_batch": ["num_tokens", 2000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me, too.
@sai-prasanna good ideas, thanks for the input. Can you provide some more details about how you integrate allennlp with fairseq? How do you train the model -- with fairseq or the allennlp |
@sai-prasanna, the main benefit of this PR is that it allows one to have larger batches (and thus train faster). For reasons I don't entirely understand our Your points about using |
@matt-gardner , thanks for the review! I can delete |
Re: deleting the method, it was part of experimental behavior. We added the method (instead of using pytorch's version) for one purpose (to handle complex data structures) for which it didn't actually work, as evidenced by my failing test (I guess it worked for |
@matt-gardner Yeah, it should be a separate issue. I thought this commit had an effect on performance. @matt-peters I am using allennlp model and trainer https://gist.github.com/sai-prasanna/9b02b282894a3b01647c8704dc28b013 which has poor performance. Tested it on two different multi GPU machines with 3 1080Tis. Our team compared with using fairseq's default trainers separately. Fairseq uses its own training flow where dataset -> token idx preprocessing happens first. Then they use the idx to form tensors directly for training. I controlled for that affecting the speed by using multiprocess dataiterator in allennlp. The performance difference is stark. Fairseq has better performance (1.4-1.5x) on two GPUs, But allennlp trainer is slower than single gpu training. We will be trying to make allennlp trainer to use datadistributed immediately if the changes are simple and test out the performance. |
@sai-prasanna, if you can figure out ways to make our multi-GPU code work faster, the help would be greatly appreciated. We're a very small team, and we have a lot of other things to focus on. Any help diagnosing particular issues or giving recommendations (or PRs!) on how to make things faster would be great. |
@sai-prasanna, an important point of clarification: This PR does have an affect on runtime performance. For the language modeling task described in Of course, I can't guarantee every model will see such improvements, but it might be worth double checking your batch size (or Thanks again for the feedback! |
Thanks for the review, @matt-gardner! ( |
* Remove semantic parsing code * Fix module imports, remove more tests * More fixes * fix predict test * fix another test * remove more docs * last doc errors, i think... * Remove unnecessary requirements * Removing some fixutre data that I missed earlier * Fix merge conflicts with black * More merge conflicts, pin to pytorch 1.2 * black merge conflicts * Remove unidecode
bidirectional_language_model.jsonnet
by 2x giving a 1.5x speedup.