Skip to content

Enhancement/auto device option #600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

BenAjayiObe
Copy link
Contributor

@BenAjayiObe BenAjayiObe commented Feb 27, 2020

Context

We would like to modify the to_device function to add the option for users to choose "auto" or None. By selecting "auto" the user will be allowing the skorch tool to automatically detect if GPU's is available. This functionality will mean that the user is no longer burdens with writing repetitive boiler plate code like the following:

_ = "cuda" if torch.cuda.is_available() else "cpu"

By selecting, None, we leave all tensors and modules unmodified so that a user can still customise how they allocate the devices themselves without us interfering.

Changes

  • CHANGES.md: Added note about changes
  • examples/benchmarks/mnist.py: Updated Mnist example to use to_device function.
  • examples/word_language_model/data.py: Updated language model example to use to_device function.
  • skorch/net.py: Updated device parameter description and added to_device where needed.
  • skorch/tests/test_utils.py: Added tests for new "auto" and None options in to_device.
  • skorch/utils.py: Updated to_device function to automatically find the fastest device or leave the input unmodified, depending on the argument passed.

Tests

  • Modified existing tests in class TestToDevice to update the expected value depending on the argument passed to to_device.
  • Added a new test to test for when a Packed Sequence object is passed to to_device as this object needs special handling. It is a child of a namedTuple, however, if to_device operated on a PackedSequence object the same way it does to a regular tuple, it would break the execution as some of its attributes aren't tensors.

Pulling Latest Skorch Code
Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this addition, you did a very good job of finding all instances that need to be changed and of extending the tests. We're almost good to go.

I have a few minor comments that you could address. Apart from those, could you please update the CHANGES.md to include this new addition?

@BenAjayiObe
Copy link
Contributor Author

BenAjayiObe commented Mar 2, 2020

Hi @BenjaminBossan , thank you very much for your comments. I'll action those asap. I just wanted to apologise as I've accidentally submitted this PR prematurely, hence the empty MR description. In addition to actioning your comments, I also have to implement the None feature testing to prevent forced device casting.

This PR should be updated shortly.

@BenAjayiObe BenAjayiObe force-pushed the enhancement/auto-device-option branch from 7f819a9 to 268c22b Compare March 2, 2020 15:12
BenAjayiObe pushed a commit to BenAjayiObe/skorch that referenced this pull request Mar 2, 2020
    *  use the fastest device available
    *  leave the device type unmodified respectively
(skorch-dev#600)
@BenAjayiObe BenAjayiObe force-pushed the enhancement/auto-device-option branch from 268c22b to 64f8a1a Compare March 2, 2020 15:14
BenAjayiObe pushed a commit to BenAjayiObe/skorch that referenced this pull request Mar 2, 2020
    *  use the fastest device available
    *  leave the device type unmodified respectively
(skorch-dev#600)
@BenAjayiObe BenAjayiObe force-pushed the enhancement/auto-device-option branch from 64f8a1a to 3755ed4 Compare March 2, 2020 15:18
    *  use the fastest device available
    *  leave the device type unmodified respectively
(skorch-dev#600)
@BenAjayiObe BenAjayiObe force-pushed the enhancement/auto-device-option branch from 3755ed4 to a1fc5ea Compare March 2, 2020 15:39
@BenjaminBossan
Copy link
Collaborator

@BenAjayiObe thanks for adding the device=None option as well, that's great! I have a few minor comments, apart from that this looks good.

Since it's a more fundamental change, I would like @ottonemo or @thomasjpfan to also have a look at this before merging.

Small comment: It's nice that you rebased your existing commit. However, it makes it harder for me to review the changes because it's not obvious what was changed since my last review and what stayed the same. Could you please, if you add more changes, just merge them? Then, before, merging the whole PR into master, we will squash it into one commit.

Copy link
Member

@ottonemo ottonemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this is looking pretty mature already.

I need to address two concerns though. Firstly:

>>> torch.device('auto')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu device type at start of device string: auto
  1. obviously we need to make sure that torch.device(self.device) doesn't appear in the code anywhere (it still does at some places, for example when loading parameters)
  2. (more important) there are many more options than just 'cpu' and 'cuda' so 'auto' is misleading or at least not well defined. On an AMD platform I would expect 'auto' to select 'hip' or 'opencl' instead of 'cuda', for example.

Secondly, load_params won't work with this change when setting device='auto' since it relies on self.device for pointing to the target device. Of course it is possible to check which device is requested in load_params as well but maybe it would be smarter to do this once during initialization and rely on self.device afterwards.

What are your opinions?

@BenAjayiObe
Copy link
Contributor Author

BenAjayiObe commented Mar 9, 2020

@BenjaminBossan Thanks for your comments. Ah, apologises for obscuring my changes with the rebase. I'll preserve the commit history, as you say, in the future. 👍

@BenAjayiObe
Copy link
Contributor Author

BenAjayiObe commented Mar 10, 2020

@ottonemo , Thanks for your comments.

On your first point

obviously we need to make sure that torch.device(self.device) doesn't appear in the code anywhere

This is a great point, thanks for flagging. I will update the code to ensure that it isn't used anywhere.

there are many more options than just 'cpu' and 'cuda'

I feel that it would be a complex piece of work to programmatically account for which device to use under which circumstances for all the devices that pytorch supports. To me, the original spirit of the issue was to remove the burden of the most common case of checking if cuda is available.

Perhaps, instead of allowing an "auto" category for self.device, we could specify that if they enter "cuda" we will automatically default to "cpu" if "cuda" isn't available. We could then output a warning informing the user that cuda wasn't available and has been replaced by "cpu" in such circumstances.

I believe this would resolve your second concern as well. Let me know what you think.

@BenjaminBossan
Copy link
Collaborator

Perhaps, instead of allowing an "auto" category for self.device, we could specify that if they enter "cuda" we will automatically default to "cpu" if "cuda" isn't available. We could then output a warning informing the user that cuda wasn't available and has been replaced by "cpu" in such circumstances.

I wouldn't go that way. Even with a warning, this can be easily missed. I prefer a hard error than silently switching the device.

We're not changing the default behavior for anyone. So we should keep this in mind: Do we actually make it worse for anyone when we introduce this feature? If not, we should probably proceed. We should only keep in mind that we shouldn't dig ourselves into a hole that we can't get out of (easily) later.

@ottonemo
Copy link
Member

there are many more options than just 'cpu' and 'cuda'

I feel that it would be a complex piece of work to programmatically account for which device to use under which circumstances for all the devices that pytorch supports. To me, the original spirit of the issue was to remove the burden of the most common case of checking if cuda is available.

Perhaps, instead of allowing an "auto" category for self.device, we could specify that if they enter "cuda" we will automatically default to "cpu" if "cuda" isn't available. We could then output a warning informing the user that cuda wasn't available and has been replaced by "cpu" in such circumstances.

I believe this would resolve your second concern as well. Let me know what you think.

I'm torn on this. In this very circumstance I think it would be generally OK to do it the automagic way since the user can always assert torch.cuda.is_available() beforehand if they want to make sure that CUDA will be used but I also concur with @BenjaminBossan that changing the default which was failing hard before to a soft warning can be potentially dangerous (even if this is 'just' about training time, naively speaking).

Can we do something like this? (roughly sketching):

class NeuralNet:
    def __init__(self, ..., device='auto_cuda'):
        self.device = get_device(device)

while keeping the self.device handling as-is? I would expect get_device for input 'auto_cuda' to return 'cuda' with 'cpu' as fallback while 'cuda' and all other device variants will just be passed through. This would enable us to build similar fallbacks for other devices in the future as well and self.device will point to the actual device the computation will be done on.
What do you think?

@thomasjpfan
Copy link
Member

I am -1 on doing this magically without a parameter in NeuralNet. The suggestion by @ottonemo looks good.

@BenAjayiObe
Copy link
Contributor Author

BenAjayiObe commented Mar 12, 2020

@BenjaminBossan , @ottonemo, @thomasjpfan , thank you for your thoughts

What I'm taking away from this is that a user must always initiate the auto feature through deliberate action. I can see the sense in this argument.

Building on your suggestion,@ottonemo , if the intention is to grow Skorch's coverage of Pytorchs other devices over time, should we ensure that the way that user's interact with the auto feature reflects this now?

So for example, I could imagine having a new boolean parameter for NeuralNet(), called something like auto_find_device, that defaults toFalse. Setting this this to True would then ensure that we call your get_device for the devices that we currently support. This way when we extend the list of devices we can auto-find, we just have to add the code to get_device and update the doc string.

Let me know what you think.

@BenjaminBossan
Copy link
Collaborator

Can we do something like this? (roughly sketching):

class NeuralNet:
    def __init__(self, ..., device='auto_cuda'):
        self.device = get_device(device)

It won't be quite as easy since we should set the init parameters the same as they're passed (for sklearn compatibility). Therefore, we either

  1. need to introduce device_ for the actual device (with an accompanying initialize_device?) or
  2. provide a property for device that returns get_device(self._device).

The disadvantage of 1. is that many people may already rely on self.device in their custom and hence their code could suddenly be subtly wrong.

Option 2. could also have some subtle consequences that I'm not quite sure of (maybe problems with clone or pickle incompatibility).

How about a bit of a different direction: Could we provide a helper function auto_device that does this automatic handling? It would look like this:

net = NeuralNet(..., device=auto_device())  # use cuda if available, else cpu

That could be extended to support new devices.

So for example, I could imagine having a new boolean parameter for NeuralNet(), called something like auto_find_device

Tbh, this sounds like too much complexity for what should be a simple feature.

It's too bad that this feature turns out to be more nuanced than initially thought. However, the None option should be a good addition without drawbacks.

@BenAjayiObe
Copy link
Contributor Author

BenAjayiObe commented Mar 21, 2020

@BenjaminBossan

It won't be quite as easy since we should set the init parameters the same as they're passed (for sklearn compatibility).

That's a shame. It sounds like it would be sensible to avoid this route then.

Tbh, this sounds like too much complexity for what should be a simple feature.

That's fair

Thanks for your suggestion. I think it would offer a good way to completely decouple the auto device logic from the rest of the code, so future work would be fairly straightforward.

My one concern is whether this route is any less work that just calling cuda.is_available() for the user. I think that for just the cuda vs cpu case, perhaps not. However, as the feature starts to include more device options I think it's justified.

Another reason I like this is because at the moment I'm not sure how I would go about discovering if I have some of the other Pytorch compatible devices available to me and I'm sure this is the case for other developers as well. I could see people using Skorch for this function alone, if it means avoiding googling "how to detect device X on my machine" and going down that rabbit hole.

@BenjaminBossan
Copy link
Collaborator

My one concern is whether this route is any less work that just calling cuda.is_available() for the user. I think that for just the cuda vs cpu case, perhaps not. However, as the feature starts to include more device options I think it's justified.

So maybe we could postpone this functionality and pick it up again once more devices are supported?

As mentioned above, I still think the None option could be useful for some users and we could go ahead with it. Also, I think it makes sense that we don't directly cast to device but instead always use to_device. Thus, to make progress, I would suggest to 1) remove the 'auto' feature from this PR for now or 2) create a separate PR with only the mentioned changes.

@BenAjayiObe
Copy link
Contributor Author

So maybe we could postpone this functionality and pick it up again once more devices are supported?

This sounds good to me.

Okay, I'll remove the auto functionality, as you say. It's a shame, but I think its makes sense.

@BenAjayiObe
Copy link
Contributor Author

One point that I would like to highlight is that I'm currently defaulting to the fallback_device when skorch.net.NeuralNet#load_params is called and the user has inputted device=None. As the fallback_device is 'cpu' by default and it isn't set when called by load_params, I think that this keeps it inline with the device default across the rest of the repo.

Let me know what you think.

@BenjaminBossan
Copy link
Collaborator

Thanks for the changes, this looks reasonable to me. @ottonemo do you have anything to add?

Another nice thing about this PR is the consistent use of to_device within the code base, which I like.

@ottonemo
Copy link
Member

ottonemo commented Apr 6, 2020

Thanks @BenAjayiObe for the work you have put into this. I'm sure we will revisit the auto functionality when there is more substance to it than just convenience. Maybe we can combine this with support for XLA.

Copy link
Collaborator

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, thanks a lot

@BenjaminBossan BenjaminBossan merged commit 2adee59 into skorch-dev:master Apr 7, 2020
BenjaminBossan added a commit that referenced this pull request Jul 28, 2022
In #600, we introduced the option to set device=None, which means that
skorch should not move any data to any device. However, this setting
introduced a bug when trying to load the net, as that code didn't
explicitly deal with device=None. With this PR, the bug is fixed.

Implementation

If device=None, we really have no way of knowing what device to map the
parameters to. The most reasonable thing to do is to use the fallback,
which is 'cpu'.
BenjaminBossan added a commit that referenced this pull request Jul 28, 2022
In #600, we introduced the option to set device=None, which means that
skorch should not move any data to any device. However, this setting
introduced a bug when trying to load the net, as that code didn't
explicitly deal with device=None. With this PR, the bug is fixed.

Implementation

If device=None, we really have no way of knowing what device to map the
parameters to. The most reasonable thing to do is to use the fallback,
which is 'cpu'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants