- Set num workers for dataloader (done), usually data loader will load synchronously
- Don't use
zero_grad
, instead set param.grad to none (done? Seems to be the new default)
- Can try to execute this on the code and see if it helps figure out what slow down everything.
Nvidia presentation of speedups
torch.backends.cudnn.benchmark = True
Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups
- Accumulate gradients
Baseline 35 sec, which mean it would take 36 hours to train one of the models, and almost a week to train 4 models.
-> torch.backends.cudnn.benchmark = True
, seems to be unchanged
-> Update to torch 2, slowed down (10 sec ish).
-> Add torch.compile
and check the pytorch version, minor speedup.
-> Switch to use torch.no_grad
in test function, no big change.
-> set_to_none=True
in the zero_grad
function (no big change).
-> Accumulate gradients (no big change)
-> set default device, did not work well
-> Check if both train and test is slow
train : 19.764665842056274
testing : 9.0661141872406
-> Profiling time spent in forward, it's only 3 seconds about of those 29 seconds. HM. -> Writing a custom dataloader that does the transform and load everything into memory.
Uses 75 gb for train + test when starting.
-> It actually seems to slow down the training loop 4x, what. -> Same is true for the testing loop, weird.
-> Added a profiler to the loop, and tried to find the bug. So it looks like what take sthe most time is moving to device and calculating the optimize step
-> Huh, halved the batch_size which made it go from 16sec to 3 sec on the move to device. -> However, now infer and backward suddenly went up. -> Resizing the images down to 32x32, and trying to increase the batch_size that helped. OpenAI used 128 as batch size though.
-> try to adjust to 16 bit precision ? -> try to run python -m torch.utils.bottleneck -> Is the transforms that take time ? -> https://discuss.pytorch.org/t/how-improve-the-speed-of-training-after-using-transforms-resize/95994 seems relevant