Small changes to net that make grad accumulation easier #516

BenjaminBossan · 2019-08-25T20:38:24Z

In #506, there was a discussion about how to implement gradient accumulation with skorch. Unfortunately, it seemed to me that it is not easily possible. However, I believe that a small number of tweaks are enough to make it possible:

move self.optimizer_.zero_grad() from train_step_single to train_step (might break some code but only of very advanced users)
since step can now be None for train in fit_loop, add a conditional to only record things if step is not None (ugly but not back breaking)

I hope that this actually does what I read about gradient accumulation, since I have never used it myself. Maybe @fabiocapsouza can comment on that.

Overall, if this should really be enough, I think the changes would be worth implementing, given that popular architectures like BERT use it. However, I wouldn't elevate it to "first class" in skorch (yet), i.e. make it possible to control via a parameter. Instead, I would probably document how to do something similar to the test (possibly in the FAQ).

PS: Not quite sure what to do with train_batch_count? Only increment it each time the optimizer is called or for each batch? I guess the latter is more fitting?

fabiocapsouza · 2019-08-25T22:38:59Z

Hi Benjamin,
I am quite new to Skorch, so please correct me if I get something wrong.
I agree that moving self.optimizer_.zero_grad() from train_step_single is necessary to allow accumulation of gradients of many backward steps.
However, it looks like GradAccNet.train_step is not performing loss.backward() on all training forward steps, since self.train_step_single will only be called every 2 batches (when self.optimizer.step(step_fn) is called). Is there a way to test this? Maybe attaching a callback and asserting that on_grad_computed was called b_total times would work?
It would be necessary to perform loss.backward() on all training batches, but calling optimizer.step() conditionally. I believe it requires calling step_fn() directly, but I am not sure if it would break things.

It is also necessary to divide the loss by the number of accumulated steps to correct the effective batch size. That is, if we want an effective batch size of 8 using an instantaneous batch size of 2 (4 steps of grad. accumulation), the returned loss per step will be the mean loss considering only 2 samples instead of 8, so we have to divide it by 4. I think this correction can be performed by a Callback that divides the gradients directly inside on_grad_computed every 4 batches, before optimizer.step is called.

loss.backward() needs to be called for each batch.

BenjaminBossan · 2019-08-26T18:29:21Z

@fabiocapsouza Thanks for your feedback. Indeed my implementation was not quite right. Could you please have a look if it is now correct?

To implement the change, I had to sacrifice calling optimizer.step(step_fn) with the closure, not sure if the two are mutually exclusive.

Is there a way to test this? Maybe attaching a callback and asserting that on_grad_computed was called b_total times would work?

This wouldn't be a direct test thus not really helpful. I think with the current code it's clear enough that loss.backward() is called for each batch.

It is also necessary to divide the loss by the number of accumulated steps to correct the effective batch size.

I left this out at first since this PR is not about actually implementing gradient accumulation, just about showing implementing it is feasible. The normalization part shouldn't be the problem. Still I have changed the test to divide the loss, is that correct?
(What's also missing is that right now, if there is an uneven number of train batches, there should be an optimizer step for the last batch during the same epoch.)

fabiocapsouza · 2019-08-27T16:43:13Z

Still I have changed the test to divide the loss, is that correct?

The loss has to be divided before calling loss.backward() (so the produced gradients are smaller). I think there are two paths:

Dividing the loss before backward() inside train_step_single or get_loss on every batch.
Keeping the losses intact and adjusting the gradients before calling optimizer_.step().

I have seen # 1 only (it is how it is done in pytorch_transformers library). In theory, # 2 should be the same, however the loss history will be distorted. But as you have said, it is not the scope of this PR :)

(What's also missing is that right now, if there is an uneven number of train batches, there should be an optimizer step for the last batch during the same epoch.)

Good catch.

Overall it looks good to me!

BenjaminBossan · 2019-08-30T20:30:13Z

1. Dividing the loss before `backward()` inside `train_step_single` or `get_loss` on every batch.

I changed the code to reflect this.

To make this PR ready, I will probably add code similar to the test to the FAQ.

Test different accumulation step sizes, fix a bug in calculating expected number.

thomasjpfan

Regarding elevating to a parameter, this seems fairly popular to do:

pytorch-lightning has a parameter
fastai's callback system is flexible to have a AccumulateScheduler

Although, accumulating gradients may go out of fashion in months.

Overall this PR looks good as a non-committal change.

thomasjpfan · 2019-09-04T23:19:32Z

docs/user/FAQ.rst

+
+    ACC_STEPS = 2  # number of steps to accumulate before updating weights
+
+    class GradAccNet(net_cls):


Should we detail what net_cls is? Could this be NeuralNetClassifer to be slightly more concrete?

thomasjpfan · 2019-09-04T23:43:26Z

skorch/net.py

@@ -730,10 +730,12 @@ def fit_loop(self, X, y=None, epochs=None, **fit_params):
                yi_res = yi if not y_train_is_ph else None
                self.notify('on_batch_begin', X=Xi, y=yi_res, training=True)
                step = self.train_step(Xi, yi, **fit_params)
+                train_batch_count += 1
+                if not step:


Does train_step return a falsy value for this to trigger?

What do you mean? That we should check for if step is not None?

The examples in this PR that overwrite train_step we return something.

Yes, you are right. This was a remnant of my earlier attempt. Fixed now.

BenjaminBossan · 2019-09-07T14:18:18Z

Regarding elevating to a parameter, this seems fairly popular to do:

I wouldn't go ahead just yet to implement this. As you said, maybe it goes out of fashion soon. Also, in contrast to at least fastai, skorch was never meant to deliver everything out of the box. Instead, it should be "hackable" enough that users can implement almost everything they want without too much hassle. I think this is achieved here.

thomasjpfan

LGTM

Small changes to net that make grad accumulation easier

d14b26a

BenjaminBossan self-assigned this Aug 25, 2019

Move some lines around for less indentation

a9dcbbf

Correct wrong implementation of gradient accumulation

e9578b0

loss.backward() needs to be called for each batch.

Divide loss at the correct place.

6ebbdec

BenjaminBossan added 4 commits August 31, 2019 10:56

Improve and fix test for gradient accumulation

fb7b73d

Test different accumulation step sizes, fix a bug in calculating expected number.

Add entry to FAQ for how to do gradient accumulation

4881763

Merge branch 'master' into gradient-accumulation

256a91a

Entry to CHANGES.md

9b028f1

BenjaminBossan marked this pull request as ready for review August 31, 2019 08:59

BenjaminBossan requested review from ottonemo and thomasjpfan August 31, 2019 08:59

thomasjpfan reviewed Sep 4, 2019

View reviewed changes

BenjaminBossan added 2 commits September 5, 2019 20:58

Use correct class name in FAQ example

d37f134

Remove unnecessary check

051c412

thomasjpfan approved these changes Sep 13, 2019

View reviewed changes

ottonemo approved these changes Sep 16, 2019

View reviewed changes

ottonemo merged commit a62e419 into master Sep 16, 2019

BenjaminBossan deleted the gradient-accumulation branch September 22, 2019 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small changes to net that make grad accumulation easier #516

Small changes to net that make grad accumulation easier #516

BenjaminBossan commented Aug 25, 2019

fabiocapsouza commented Aug 25, 2019

BenjaminBossan commented Aug 26, 2019

fabiocapsouza commented Aug 27, 2019

BenjaminBossan commented Aug 30, 2019

thomasjpfan left a comment

thomasjpfan Sep 4, 2019

thomasjpfan Sep 4, 2019

BenjaminBossan Sep 5, 2019

thomasjpfan Sep 5, 2019

BenjaminBossan Sep 7, 2019

BenjaminBossan commented Sep 7, 2019

thomasjpfan left a comment


		ACC_STEPS = 2 # number of steps to accumulate before updating weights

		class GradAccNet(net_cls):

Small changes to net that make grad accumulation easier #516

Small changes to net that make grad accumulation easier #516

Conversation

BenjaminBossan commented Aug 25, 2019

fabiocapsouza commented Aug 25, 2019

BenjaminBossan commented Aug 26, 2019

fabiocapsouza commented Aug 27, 2019

BenjaminBossan commented Aug 30, 2019

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Sep 4, 2019

Choose a reason for hiding this comment

thomasjpfan Sep 4, 2019

Choose a reason for hiding this comment

BenjaminBossan Sep 5, 2019

Choose a reason for hiding this comment

thomasjpfan Sep 5, 2019

Choose a reason for hiding this comment

BenjaminBossan Sep 7, 2019

Choose a reason for hiding this comment

BenjaminBossan commented Sep 7, 2019

thomasjpfan left a comment

Choose a reason for hiding this comment