-
Notifications
You must be signed in to change notification settings - Fork 397
Add DistributedHistory for multi-gpu training #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Description (this is more or less copied from the docs) When training a net in a distributed setting, e.g. when using torch.nn.parallel.DistributedDataParallel, directly or indirectly with the help of AccelerateMixin, the default history class should not be used. This is because each process will have its own history instance with no syncing happening between processes. Therefore, the information in the histories can diverge. When steering the training process through the histories, the resulting differences can cause trouble. When using early stopping, for instance, one process could receive the signal to stop but not the other. DistributedHistory will take care of syncing the distributed batch information across processes, which will prevent the issue just described. This class needs to be initialized with a distributed store provided by PyTorch (https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store). I have only tested torch.distributed.TCPStore so far, but torch.distributed.FileStore should also work. The DistributedHistory also needs to be initialized with its rank and the world size (number of processes) so that it has all the required information to perform the syncing. When using accelerate, that information can be retrieved from the Accelerator instance. Comment Even though the batch information, which is split across processes, is synced, the epoch information, which is not split, is *not* synced. E.g. the recorded duration can be different between processes. It is not quite clear what the "correct" behavior should be here, it would probably depend on what is done based on this information. To make it possible to use the new class, I had to change the net initialization code to not reinitialize the history when it is not None. Otherwise, calling fit would always overwrite the DistributedHistory with a normal History object. Also, unfortunately, TCPStore cannot be pickled. Therefore, I set it to None when pickling. This is not tragic as long as users pickle the final model and only load it for inference. If they want to keep on training, they would need to set the net.history.store manually.
Got a segfault on CI with PyTorch v1.11 and Python=3.10.
For the history of this feature, see this comment and associated thread. Ping @Raphaaal. |
accelerator.process_index seems to be something else here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks clean and well thought out. LGTM
skorch/history.py
Outdated
state = self.__dict__.copy() | ||
try: | ||
pickle.dumps(state['store']) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be beneficial to know why this is expected
skorch/history.py
Outdated
:class:`torch.distributed.TCPStore` has been tested to work. | ||
|
||
rank : int | ||
The rank of this process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a single description of the parameter this is quite tautological. Maybe it is worthwhile to remark the properties and the possible origin of the rank? E.g. "Number differentiating the distributed training processes, e.g. as provided by accelerate.local_process_index
."?
- Better docstring - More code comments Also: - Catch PicklingError in case user provides custom Store that can't be pickled.
@githubnemo Very good points, I have extended the docstrings and added more context to the comment in Oh and btw. coverage claims that these lines are not covered, which is not true (adding an error there will raise). I suspect that coverage is not correctly measured because of the use of multiprocessing. |
Failing CI seems to be unrelated. |
DistributedHistory added in #955.
Preparation for release of version 0.13.0 Release text: The new skorch release is here and it has some changes that will be exiting for some users. - First of all, you may have heard of the [PyTorch 2.0 release](https://pytorch.org/get-started/pytorch-2.0/), which includes the option to compile the PyTorch module for better runtime performance. This skorch release allows you to pass `compile=True` when initializing the net to enable compilation. - Support for training on multiple GPUs with the help of the [`accelerate`](https://huggingface.co/docs/accelerate/index) package has been improved by fixing some bugs and providing a dedicated [history class](https://skorch.readthedocs.io/en/latest/user/history.html#distributed-history). Our documentation contains more information on [what to consider when training on multiple GPUs](https://skorch.readthedocs.io/en/latest/user/huggingface.html#caution-when-using-a-multi-gpu-setup). - If you have ever been frustrated with your neural net not training properly, you know how hard it can be to discover the underlying issue. Using the new [`SkorchDoctor`](https://skorch.readthedocs.io/en/latest/helper.html#skorch.helper.SkorchDoctor) class will simplify the diagnosis of underlying issues. Take a look at the accompanying [notebook](https://nbviewer.org/github/skorch-dev/skorch/blob/master/notebooks/Skorch_Doctor.ipynb) Apart from that, a few bugs have been fixed and the included notebooks have been updated to properly install requirements on Google Colab. We are grateful for external contributors, many thanks to: - Kshiteej K (kshitij12345) - Muhammad Abdullah (abdulasiraj) - Royi (RoyiAvital) - Sawradip Saha (sawradip) - y10ab1 (y10ab1) Find below the list of all changes since v0.12.1 below: ### Added - Add support for compiled PyTorch modules using the `torch.compile` function, introduced in [PyTorch 2.0 release](https://pytorch.org/get-started/pytorch-2.0/), which can greatly improve performance on new GPU architectures; to use it, initialize your net with the `compile=True` argument, further compilation arguments can be specified using the dunder notation, e.g. `compile__dynamic=True` - Add a class [`DistributedHistory`](https://skorch.readthedocs.io/en/latest/history.html#skorch.history.DistributedHistory) which should be used when training in a multi GPU setting (#955) - `SkorchDoctor`: A helper class that assists in understanding and debugging the neural net training, see [this notebook](https://nbviewer.org/github/skorch-dev/skorch/blob/master/notebooks/Skorch_Doctor.ipynb) (#912) - When using `AccelerateMixin`, it is now possible to prevent unwrapping of the modules by setting `unwrap_after_train=True` (#963) ### Fixed - Fixed install command to work with recent changes in Google Colab (#928) - Fixed a couple of bugs related to using non-default modules and criteria (#927) - Fixed a bug when using `AccelerateMixin` in a multi-GPU setup (#947) - `_get_param_names` returns a list instead of a generator so that subsequent error messages return useful information instead of a generator `repr` string (#925) - Fixed a bug that caused modules to not be sufficiently unwrapped at the end of training when using `AccelerateMixin`, which could prevent them from being pickleable (#963)
Description
(this is more or less copied from the docs)
When training a net in a distributed setting, e.g. when using
torch.nn.parallel.DistributedDataParallel
, directly or indirectly with the help ofAccelerateMixin
, the default history class should not be used. This is because each process will have its own history instance with no syncing happening between processes. Therefore, the information in the histories can diverge. When steering the training process through the histories, the resulting differences can cause trouble. When using early stopping, for instance, one process could receive the signal to stop but not the other.DistributedHistory
will take care of syncing the distributed batch information across processes, which will prevent the issue just described.This class needs to be initialized with a distributed store provided by PyTorch (https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store). I have only tested
torch.distributed.TCPStore
so far, buttorch.distributed.FileStore
should also work. TheDistributedHistory
also needs to be initialized with its rank and the world size (number of processes) so that it has all the required information to perform the syncing. When usingaccelerate
, that information can be retrieved from theAccelerator
instance.Comments
None
) -- instead the history is just cleared. Otherwise, calling fit would always overwrite theDistributedHistory
with a normalHistory
object.TCPStore
cannot be pickled. Therefore, I set it toNone
when pickling. This is not tragic as long as users pickle the final model and only load it for inference. If they want to keep on training, they would need to set thenet.history.store
manually.time.sleep
calls in the multiprocessing tests to avoid "broken pipe" etc. Update: CI had segfaults for PyTorch 1.11, so I'm skipping tests withDistributedHistory
for that PyTorch version. Maybe the tests are just flaky, but we don't want that.