-
Notifications
You must be signed in to change notification settings - Fork 398
Small changes to net that make grad accumulation easier #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi Benjamin, It is also necessary to divide the loss by the number of accumulated steps to correct the effective batch size. That is, if we want an effective batch size of 8 using an instantaneous batch size of 2 (4 steps of grad. accumulation), the returned loss per step will be the mean loss considering only 2 samples instead of 8, so we have to divide it by 4. I think this correction can be performed by a Callback that divides the gradients directly inside |
loss.backward() needs to be called for each batch.
@fabiocapsouza Thanks for your feedback. Indeed my implementation was not quite right. Could you please have a look if it is now correct? To implement the change, I had to sacrifice calling
This wouldn't be a direct test thus not really helpful. I think with the current code it's clear enough that
I left this out at first since this PR is not about actually implementing gradient accumulation, just about showing implementing it is feasible. The normalization part shouldn't be the problem. Still I have changed the test to divide the loss, is that correct? |
The loss has to be divided before calling
I have seen # 1 only (it is how it is done in
Good catch. Overall it looks good to me! |
I changed the code to reflect this. To make this PR ready, I will probably add code similar to the test to the FAQ. |
Test different accumulation step sizes, fix a bug in calculating expected number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding elevating to a parameter, this seems fairly popular to do:
- pytorch-lightning has a parameter
- fastai's callback system is flexible to have a AccumulateScheduler
Although, accumulating gradients may go out of fashion in months.
Overall this PR looks good as a non-committal change.
docs/user/FAQ.rst
Outdated
|
||
ACC_STEPS = 2 # number of steps to accumulate before updating weights | ||
|
||
class GradAccNet(net_cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we detail what net_cls
is? Could this be NeuralNetClassifer
to be slightly more concrete?
skorch/net.py
Outdated
@@ -730,10 +730,12 @@ def fit_loop(self, X, y=None, epochs=None, **fit_params): | |||
yi_res = yi if not y_train_is_ph else None | |||
self.notify('on_batch_begin', X=Xi, y=yi_res, training=True) | |||
step = self.train_step(Xi, yi, **fit_params) | |||
train_batch_count += 1 | |||
if not step: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does train_step
return a falsy value for this to trigger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? That we should check for if step is not None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples in this PR that overwrite train_step
we return something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. This was a remnant of my earlier attempt. Fixed now.
I wouldn't go ahead just yet to implement this. As you said, maybe it goes out of fashion soon. Also, in contrast to at least fastai, skorch was never meant to deliver everything out of the box. Instead, it should be "hackable" enough that users can implement almost everything they want without too much hassle. I think this is achieved here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In #506, there was a discussion about how to implement gradient accumulation with skorch. Unfortunately, it seemed to me that it is not easily possible. However, I believe that a small number of tweaks are enough to make it possible:
self.optimizer_.zero_grad()
fromtrain_step_single
totrain_step
(might break some code but only of very advanced users)step
can now be None for train infit_loop
, add a conditional to only record things if step is not None (ugly but not back breaking)I hope that this actually does what I read about gradient accumulation, since I have never used it myself. Maybe @fabiocapsouza can comment on that.
Overall, if this should really be enough, I think the changes would be worth implementing, given that popular architectures like BERT use it. However, I wouldn't elevate it to "first class" in skorch (yet), i.e. make it possible to control via a parameter. Instead, I would probably document how to do something similar to the test (possibly in the FAQ).
PS: Not quite sure what to do with
train_batch_count
? Only increment it each time the optimizer is called or for each batch? I guess the latter is more fitting?