Remove scattering for multi-GPU training. #2200

brendan-ai2 · 2018-12-18T01:10:06Z

Instead just pull off a batch for each GPU.
Enables increasing the effective batch size for bidirectional_language_model.jsonnet by 2x giving a 1.5x speedup.

- Configuration for training a transformer based bidirectional LM. - Training ongoing with sampled loss currently at 3.8411. - Minor fixes to CnnHighwayEncoder. - LayerNorm was needed instead of MaskedLayerNorm. - Log average batch size during training.

…aset_modifications_3

…an-ai2/allennlp into lm_without_dataset_modifications_3

brendan-ai2 · 2019-01-17T05:23:57Z

training_config/bidirectional_language_model.jsonnet

@@ -34,7 +34,7 @@ local BASE_ITERATOR = {
  // samples in every batch.
  "batch_size": 512 * NUM_GPUS,
  "sorting_keys": [["source", "num_tokens"]],
-  "maximum_samples_per_batch": ["num_tokens", NUM_GPUS * 1000]
+  "maximum_samples_per_batch": ["num_tokens", 2000]


There's a minor backwards compatibility issue here. We're effectively multiplying the batch size (for multi-GPU users) by the number of GPUs. In practice this will result in some OOMs for users that were running close to their memory limits. Given that we had an experimental warning for that use case I think this okay, but I'm curious if you have other thoughts.

This seems fine to me, too.

brendan-ai2 · 2019-01-17T05:24:22Z

fyi, @vidurj you should be able to merge this down if you need it ASAP.

sai-prasanna · 2019-01-17T05:54:19Z

@brendan-ai2 Do you get better multi GPU utilization through this method?
For a sequence to sequence model, neither the current implementation or this implementation gets full utilization for me. I use fairseq models imported into allennlp.

But using fairseq directly makes training faster as expected when using bigger batch sizes. Fairseq uses distributed data parallel with multi processing. Idk what bottlenecks would be in current dataparallel approach we use. I suspect GIL, even though operations are done in cuda, the instructions are from python which might make GIL bottleneck as models like fairseq CNN have lots of code in python.

Even torch docs states it might be the case. https://pytorch.org/docs/stable/distributed.html

In the single-machine synchronous case, torch.distributed or the torch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over other approaches to data-parallelism, including torch.nn.DataParallel():

Each process maintains its own optimizer and performs a complete optimization step with each iteration. While this may appear redundant, since the gradients have already been gathered together and averaged across processes and are thus the same for every process, this means that no parameter broadcast step is needed, reducing time spent transferring tensors between nodes.
Each process contains an independent Python interpreter, eliminating the extra interpreter overhead and “GIL-thrashing” that comes from driving several execution threads, model replicas, or GPUs from a single Python process. This is especially important for models that make heavy use of the Python runtime, including models with recurrent layers or many small components.

matt-gardner

I had one minor question, but other than that the code looks fine to me, and better than what was here previously. There's still the issue that @sai-prasanna brings up, but figuring out how to make this faster should be a separate issue (and one that I have no experience with).

Oh, we had a custom scatter_kwargs function, right? That should be deleted now, shouldn't it? It was broken, anyway - the only reason we made it custom instead of using pytorch's version didn't actually work.

matt-gardner · 2019-01-17T15:30:19Z

allennlp/training/util.py


-    used_device_ids = cuda_devices[:len(inputs)]
+    inputs = [()] * len(batch_group)


inputs is supposed to be a list of empty tuples? This never gets updated before getting passed to parallel_apply.

Added a comment to clarify. You can see that () was passed to the old scatter_kwargs as well.

# We pass all our arguments as kwargs. Create a list of empty tuples of the # correct shape to serve as (non-existent) positional arguments.

matt-gardner · 2019-01-17T15:31:22Z

training_config/bidirectional_language_model.jsonnet

@@ -34,7 +34,7 @@ local BASE_ITERATOR = {
  // samples in every batch.
  "batch_size": 512 * NUM_GPUS,
  "sorting_keys": [["source", "num_tokens"]],
-  "maximum_samples_per_batch": ["num_tokens", NUM_GPUS * 1000]
+  "maximum_samples_per_batch": ["num_tokens", 2000]


This seems fine to me, too.

matt-peters · 2019-01-17T17:58:22Z

@sai-prasanna good ideas, thanks for the input. Can you provide some more details about how you integrate allennlp with fairseq? How do you train the model -- with fairseq or the allennlp Trainer?

brendan-ai2 · 2019-01-18T01:42:08Z

@sai-prasanna, the main benefit of this PR is that it allows one to have larger batches (and thus train faster). For reasons I don't entirely understand our scatter_kwargs implementation seemed to result in decidedly unbalanced GPU memory usage. Utilization seemed marginally better with this change, but I didn't look at that closely. In general I don't think we can promise full utilization, so we'll need to look at things in more detail if that's a major issue for you.

Your points about using torch.distributed are well taken! :) We'd definitely like to investigate that more, but that's out of scope for this PR. Would you be willing to open an issue with your insights and/or requests?

brendan-ai2 · 2019-01-18T02:02:00Z

@matt-gardner , thanks for the review! I can delete scatter_kwargs, but is it worth deprecating first? It's not clear to me if it was obviously strictly internal.

matt-gardner · 2019-01-18T04:41:49Z

Re: deleting the method, it was part of experimental behavior. We added the method (instead of using pytorch's version) for one purpose (to handle complex data structures) for which it didn't actually work, as evidenced by my failing test (I guess it worked for Metadata, but not more complex stuff?). So if anyone was using it externally (which seems very unlikely), it was broken for them too. I'd say to just remove it. And with it, we can probably also remove the ScatterableList.

sai-prasanna · 2019-01-18T14:17:07Z

@matt-gardner Yeah, it should be a separate issue. I thought this commit had an effect on performance.

@matt-peters I am using allennlp model and trainer https://gist.github.com/sai-prasanna/9b02b282894a3b01647c8704dc28b013 which has poor performance. Tested it on two different multi GPU machines with 3 1080Tis.

Our team compared with using fairseq's default trainers separately. Fairseq uses its own training flow where dataset -> token idx preprocessing happens first. Then they use the idx to form tensors directly for training. I controlled for that affecting the speed by using multiprocess dataiterator in allennlp.

The performance difference is stark. Fairseq has better performance (1.4-1.5x) on two GPUs, But allennlp trainer is slower than single gpu training.

We will be trying to make allennlp trainer to use datadistributed immediately if the changes are simple and test out the performance.

matt-gardner · 2019-01-18T14:35:47Z

@sai-prasanna, if you can figure out ways to make our multi-GPU code work faster, the help would be greatly appreciated. We're a very small team, and we have a lot of other things to focus on. Any help diagnosing particular issues or giving recommendations (or PRs!) on how to make things faster would be great.

brendan-ai2 · 2019-01-18T23:12:34Z

@sai-prasanna, an important point of clarification: This PR does have an affect on runtime performance. For the language modeling task described in training_config/bidirectional_language_model.jsonnet it effectively increased the batch size by 2x. This led to epochs taking only 67% of the time they did previously, i.e. a 1.5x speedup.

Of course, I can't guarantee every model will see such improvements, but it might be worth double checking your batch size (or maximum_samples_per_batch if you're using that) to see if it can be increased.

Thanks again for the feedback!

brendan-ai2 · 2019-01-18T23:18:04Z

Thanks for the review, @matt-gardner! (ScatterableList and friends deleted as requested.)

* Remove semantic parsing code * Fix module imports, remove more tests * More fixes * fix predict test * fix another test * remove more docs * last doc errors, i think... * Remove unnecessary requirements * Removing some fixutre data that I missed earlier * Fix merge conflicts with black * More merge conflicts, pin to pytorch 1.2 * black merge conflicts * Remove unidecode

brendan-ai2 added 30 commits November 29, 2018 17:16

Transformer ELMo

5d179a4

- Configuration for training a transformer based bidirectional LM. - Training ongoing with sampled loss currently at 3.8411. - Minor fixes to CnnHighwayEncoder. - LayerNorm was needed instead of MaskedLayerNorm. - Log average batch size during training.

wip

2db75b4

Add bidirectional transformer token embedder

f7deed3

transformer elmo config template

c9de1ec

MORE

634b4a2

Works

e4a7b51

Add broken layer norm.

9eb6e46

Address some more comments

ac425a4

Merge branch 'lm_without_dataset_modifications_2' into lm_without_dat…

f203cde

…aset_modifications_3

Fix for vidurj

bde39fe

easy feedback

4b3a81c

Fix norm issue

595b668

Rename

731e69c

Start and end tokens in reader

4522f1c

comment fix

d091cc8

fixes

971e600

style

24b763b

fix docs

71e2cce

Merge branch 'master' into lm_without_dataset_modifications_2

01a111a

Merge branch 'master' into lm_without_dataset_modifications_2

975060a

Merge branch 'lm_without_dataset_modifications_2' into lm_without_dat…

100f07f

…aset_modifications_3

cleanup

5dcd700

Merge branch 'master' into lm_without_dataset_modifications_3

87e6241

Bidirectional fixture

4b3ce38

Test

1338afb

cleanup

fa86367

Merge branch 'master' into lm_without_dataset_modifications_3

7cd29aa

Merge branch 'lm_without_dataset_modifications_3' of github.com:brend…

f6e57d1

…an-ai2/allennlp into lm_without_dataset_modifications_3

works

6ec4a6e

Model file

d7c0208

brendan-ai2 added 6 commits January 17, 2019 04:07

merge

98629f1

drop some logging

a68db07

stash pop

6eda737

fixes

63644f1

cleanup

79ed01a

Add todos

d3e4921

brendan-ai2 requested a review from matt-gardner January 17, 2019 05:19

brendan-ai2 commented Jan 17, 2019

View reviewed changes

matt-gardner approved these changes Jan 17, 2019

View reviewed changes

brendan-ai2 added 2 commits January 17, 2019 17:59

fixes

34b2adf

merge

c7a5a96

brendan-ai2 added 5 commits January 18, 2019 13:12

fixes, delete ScatterableList, scatter_kwargs, etc.

7c19f04

More cleanup

95f5804

cleanup

ee5df46

drop no-op changes

2e6f990

Merge branch 'master' into lm_train_fixes

12e62d4

brendan-ai2 merged commit 7525c61 into allenai:master Jan 18, 2019

scarecrow1123 mentioned this pull request Mar 27, 2019

Using DistributedDataParallel for multi GPU training #2536

Closed

matt-gardner mentioned this pull request Jun 17, 2019

ProductionRuleField not compatible with multiple GPUs #2057

Closed


		used_device_ids = cuda_devices[:len(inputs)]
		inputs = [()] * len(batch_group)

Remove scattering for multi-GPU training. #2200

Remove scattering for multi-GPU training. #2200

Uh oh!

Conversation

brendan-ai2 commented Dec 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brendan-ai2 Jan 17, 2019

Choose a reason for hiding this comment

Uh oh!

matt-gardner Jan 17, 2019

Choose a reason for hiding this comment

Uh oh!

brendan-ai2 Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

brendan-ai2 commented Jan 17, 2019

Uh oh!

sai-prasanna commented Jan 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matt-gardner left a comment

Choose a reason for hiding this comment

Uh oh!

matt-gardner Jan 17, 2019

Choose a reason for hiding this comment

Uh oh!

brendan-ai2 Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

matt-gardner Jan 17, 2019

Choose a reason for hiding this comment

Uh oh!

matt-peters commented Jan 17, 2019

Uh oh!

brendan-ai2 commented Jan 18, 2019

Uh oh!

brendan-ai2 commented Jan 18, 2019

Uh oh!

matt-gardner commented Jan 18, 2019

Uh oh!

sai-prasanna commented Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matt-gardner commented Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brendan-ai2 commented Jan 18, 2019

Uh oh!

brendan-ai2 commented Jan 18, 2019

Uh oh!

Uh oh!

brendan-ai2 commented Dec 18, 2018 •

edited

Loading

sai-prasanna commented Jan 17, 2019 •

edited

Loading

sai-prasanna commented Jan 18, 2019 •

edited

Loading

matt-gardner commented Jan 18, 2019 •

edited

Loading