-
Notifications
You must be signed in to change notification settings - Fork 398
Recommended way to implement gradient accumulation #506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I didn't know about gradient accumulation, so I don't have any code for you. That being said, the number of training batches can be inferred from the history: Regarding running I don't know enough about the topic to give you a solution to this. But if you tinker a bit and find a working solution, please post it here. Maybe we can then point you towards improving it. |
@fabiocapsouza Did you make progress on this? |
@BenjaminBossan, thanks for your first reply and sorry for the delay. Thanks again for your quick response, it really helped me. |
I think it will be worth exploring if we can refactor skorch to make this easier. Until then, you are probably better off using your current solution. If you want to try with skorch again at some point and maybe help improving it, feel free to open the issue again or starting a new one. |
Hi,
I am new to Skorch and I am implementing BERT fine-tuning using Skorch.
One of the features that is missing is gradient accumulation, where
loss.backward()
is called on every batch but the optimizer is called only after N consecutive batches.I believe I have to override
train_step_single
to removezero_grad()
before every step andtrain_step
to performoptimizer_.step()
conditionally.Also, the Net has to keep track of the number of batches during training and be able to keep a state that can be accessed inside
train_step
.Where is the recommended place to save this kind of state?
Thanks,
The text was updated successfully, but these errors were encountered: