Recommended way to implement gradient accumulation #506

fabiocapsouza · 2019-08-08T17:40:43Z

Hi,

I am new to Skorch and I am implementing BERT fine-tuning using Skorch.
One of the features that is missing is gradient accumulation, where loss.backward() is called on every batch but the optimizer is called only after N consecutive batches.
I believe I have to override train_step_single to remove zero_grad() before every step and train_step to perform optimizer_.step() conditionally.
Also, the Net has to keep track of the number of batches during training and be able to keep a state that can be accessed inside train_step.
Where is the recommended place to save this kind of state?

Thanks,

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2019-08-08T18:50:51Z

I didn't know about gradient accumulation, so I don't have any code for you. That being said, the number of training batches can be inferred from the history: len(self.history[-1, 'batches']) (this assumes that training steps are performed before validation steps, which is what normally happens).

Regarding running loss.backward() several time before calling the optimizer: That I think is not so trivial. Apart from overriding train_step, you might even need to override fit_loop. The reason is that we assume there that the optimizer is called for each batch.

I don't know enough about the topic to give you a solution to this. But if you tinker a bit and find a working solution, please post it here. Maybe we can then point you towards improving it.

BenjaminBossan · 2019-08-22T20:08:18Z

@fabiocapsouza Did you make progress on this?

fabiocapsouza · 2019-08-22T20:21:03Z

@BenjaminBossan, thanks for your first reply and sorry for the delay.
After your comment, I decided to postpone my migration to Skorch for this task, since it reinforced my feelings that it would require a lot of work. I would not have the time to test all the needed changes that could easily introduce bugs.

Thanks again for your quick response, it really helped me.

BenjaminBossan · 2019-08-24T08:26:46Z

since it reinforced my feelings that it would require a lot of work.

I think it will be worth exploring if we can refactor skorch to make this easier. Until then, you are probably better off using your current solution. If you want to try with skorch again at some point and maybe help improving it, feel free to open the issue again or starting a new one.

BenjaminBossan closed this as completed Aug 24, 2019

BenjaminBossan mentioned this issue Aug 25, 2019

Small changes to net that make grad accumulation easier #516

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended way to implement gradient accumulation #506

Recommended way to implement gradient accumulation #506

fabiocapsouza commented Aug 8, 2019 •

edited

Loading

BenjaminBossan commented Aug 8, 2019

BenjaminBossan commented Aug 22, 2019

fabiocapsouza commented Aug 22, 2019

BenjaminBossan commented Aug 24, 2019

Recommended way to implement gradient accumulation #506

Recommended way to implement gradient accumulation #506

Comments

fabiocapsouza commented Aug 8, 2019 • edited Loading

BenjaminBossan commented Aug 8, 2019

BenjaminBossan commented Aug 22, 2019

fabiocapsouza commented Aug 22, 2019

BenjaminBossan commented Aug 24, 2019

fabiocapsouza commented Aug 8, 2019 •

edited

Loading