Tuning tips

Set num workers for dataloader (done), usually data loader will load synchronously
Don't use zero_grad, instead set param.grad to none (done? Seems to be the new default)

Can try to execute this on the code and see if it helps figure out what slow down everything.

torch.backends.cudnn.benchmark = True

Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

Accumulate gradients

Tested changes with improvements of speed

Baseline 35 sec, which mean it would take 36 hours to train one of the models, and almost a week to train 4 models.

-> torch.backends.cudnn.benchmark = True, seems to be unchanged -> Update to torch 2, slowed down (10 sec ish). -> Add torch.compile and check the pytorch version, minor speedup. -> Switch to use torch.no_grad in test function, no big change. -> set_to_none=True in the zero_grad function (no big change). -> Accumulate gradients (no big change) -> set default device, did not work well -> Check if both train and test is slow

train : 19.764665842056274
testing : 9.0661141872406

-> Profiling time spent in forward, it's only 3 seconds about of those 29 seconds. HM. -> Writing a custom dataloader that does the transform and load everything into memory.

Uses 75 gb for train + test when starting.

-> It actually seems to slow down the training loop 4x, what. -> Same is true for the testing loop, weird.

-> Added a profiler to the loop, and tried to find the bug. So it looks like what take sthe most time is moving to device and calculating the optimize step

-> Huh, halved the batch_size which made it go from 16sec to 3 sec on the move to device. -> However, now infer and backward suddenly went up. -> Resizing the images down to 32x32, and trying to increase the batch_size that helped. OpenAI used 128 as batch size though.

-> try to adjust to 16 bit precision ? -> try to run python -m torch.utils.bottleneck -> Is the transforms that take time ? -> https://discuss.pytorch.org/t/how-improve-the-speed-of-training-after-using-transforms-resize/95994 seems relevant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes.md

notes.md

Tuning tips

Tested changes with improvements of speed

Files

notes.md

Latest commit

History

notes.md

File metadata and controls

Tuning tips

Tested changes with improvements of speed