Skip to content

Small changes to net that make grad accumulation easier #516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 16, 2019

Conversation

BenjaminBossan
Copy link
Collaborator

In #506, there was a discussion about how to implement gradient accumulation with skorch. Unfortunately, it seemed to me that it is not easily possible. However, I believe that a small number of tweaks are enough to make it possible:

  • move self.optimizer_.zero_grad() from train_step_single to train_step (might break some code but only of very advanced users)
  • since step can now be None for train in fit_loop, add a conditional to only record things if step is not None (ugly but not back breaking)

I hope that this actually does what I read about gradient accumulation, since I have never used it myself. Maybe @fabiocapsouza can comment on that.

Overall, if this should really be enough, I think the changes would be worth implementing, given that popular architectures like BERT use it. However, I wouldn't elevate it to "first class" in skorch (yet), i.e. make it possible to control via a parameter. Instead, I would probably document how to do something similar to the test (possibly in the FAQ).

PS: Not quite sure what to do with train_batch_count? Only increment it each time the optimizer is called or for each batch? I guess the latter is more fitting?

@BenjaminBossan BenjaminBossan self-assigned this Aug 25, 2019
@fabiocapsouza
Copy link

Hi Benjamin,
I am quite new to Skorch, so please correct me if I get something wrong.
I agree that moving self.optimizer_.zero_grad() from train_step_single is necessary to allow accumulation of gradients of many backward steps.
However, it looks like GradAccNet.train_step is not performing loss.backward() on all training forward steps, since self.train_step_single will only be called every 2 batches (when self.optimizer.step(step_fn) is called). Is there a way to test this? Maybe attaching a callback and asserting that on_grad_computed was called b_total times would work?
It would be necessary to perform loss.backward() on all training batches, but calling optimizer.step() conditionally. I believe it requires calling step_fn() directly, but I am not sure if it would break things.

It is also necessary to divide the loss by the number of accumulated steps to correct the effective batch size. That is, if we want an effective batch size of 8 using an instantaneous batch size of 2 (4 steps of grad. accumulation), the returned loss per step will be the mean loss considering only 2 samples instead of 8, so we have to divide it by 4. I think this correction can be performed by a Callback that divides the gradients directly inside on_grad_computed every 4 batches, before optimizer.step is called.

loss.backward() needs to be called for each batch.
@BenjaminBossan
Copy link
Collaborator Author

@fabiocapsouza Thanks for your feedback. Indeed my implementation was not quite right. Could you please have a look if it is now correct?

To implement the change, I had to sacrifice calling optimizer.step(step_fn) with the closure, not sure if the two are mutually exclusive.

Is there a way to test this? Maybe attaching a callback and asserting that on_grad_computed was called b_total times would work?

This wouldn't be a direct test thus not really helpful. I think with the current code it's clear enough that loss.backward() is called for each batch.

It is also necessary to divide the loss by the number of accumulated steps to correct the effective batch size.

I left this out at first since this PR is not about actually implementing gradient accumulation, just about showing implementing it is feasible. The normalization part shouldn't be the problem. Still I have changed the test to divide the loss, is that correct?
(What's also missing is that right now, if there is an uneven number of train batches, there should be an optimizer step for the last batch during the same epoch.)

@fabiocapsouza
Copy link

Still I have changed the test to divide the loss, is that correct?

The loss has to be divided before calling loss.backward() (so the produced gradients are smaller). I think there are two paths:

  1. Dividing the loss before backward() inside train_step_single or get_loss on every batch.
  2. Keeping the losses intact and adjusting the gradients before calling optimizer_.step().

I have seen # 1 only (it is how it is done in pytorch_transformers library). In theory, # 2 should be the same, however the loss history will be distorted. But as you have said, it is not the scope of this PR :)

(What's also missing is that right now, if there is an uneven number of train batches, there should be an optimizer step for the last batch during the same epoch.)

Good catch.

Overall it looks good to me!

@BenjaminBossan
Copy link
Collaborator Author

1. Dividing the loss before `backward()` inside `train_step_single` or `get_loss` on every batch.

I changed the code to reflect this.

To make this PR ready, I will probably add code similar to the test to the FAQ.

@BenjaminBossan BenjaminBossan marked this pull request as ready for review August 31, 2019 08:59
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding elevating to a parameter, this seems fairly popular to do:

  1. pytorch-lightning has a parameter
  2. fastai's callback system is flexible to have a AccumulateScheduler

Although, accumulating gradients may go out of fashion in months.

Overall this PR looks good as a non-committal change.


ACC_STEPS = 2 # number of steps to accumulate before updating weights

class GradAccNet(net_cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we detail what net_cls is? Could this be NeuralNetClassifer to be slightly more concrete?

skorch/net.py Outdated
@@ -730,10 +730,12 @@ def fit_loop(self, X, y=None, epochs=None, **fit_params):
yi_res = yi if not y_train_is_ph else None
self.notify('on_batch_begin', X=Xi, y=yi_res, training=True)
step = self.train_step(Xi, yi, **fit_params)
train_batch_count += 1
if not step:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does train_step return a falsy value for this to trigger?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? That we should check for if step is not None?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples in this PR that overwrite train_step we return something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. This was a remnant of my earlier attempt. Fixed now.

@BenjaminBossan
Copy link
Collaborator Author

Regarding elevating to a parameter, this seems fairly popular to do:

I wouldn't go ahead just yet to implement this. As you said, maybe it goes out of fashion soon. Also, in contrast to at least fastai, skorch was never meant to deliver everything out of the box. Instead, it should be "hackable" enough that users can implement almost everything they want without too much hassle. I think this is achieved here.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ottonemo ottonemo merged commit a62e419 into master Sep 16, 2019
@BenjaminBossan BenjaminBossan deleted the gradient-accumulation branch September 22, 2019 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants