Skip to content

A callback for modifying the loss before the optimizer obtains it. #295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
spott opened this issue Jul 23, 2018 · 18 comments
Open

A callback for modifying the loss before the optimizer obtains it. #295

spott opened this issue Jul 23, 2018 · 18 comments

Comments

@spott
Copy link
Contributor

spott commented Jul 23, 2018

There is a class of things that don't appear to be able to be possible within the current callback framework of Skorch.

The first one that comes to mind is in adversarial training. In order to do efficiently, you would want to run all samples through the network once, create new samples by adding some noise that is dependent on the gradient to the old samples, then run those through again:

Xi.requires_grad = True
output = net(Xi)
loss = criterion(output, yi)
loss.backward()

sign = torch.ge(0, Xi.grad)
sign = (sign - 0.5)/.5
Xi_at = torch.add(Xi, -epsilon, sign)

at_output = net(Xi_at)
loss_at = criterion(at_output, yi)
loss_at.backward()

optimizer.step()

Unfortunately, this isn't actually possible using the current framework (without creating a new NeuralNet class).

This could probably be fixed if Xi and yi were passed to the on_grad_computed callback, but it isn't currently.

@benjamin-work
Copy link
Contributor

In general, I see no obstacle to passing Xi and yi to on_grad_computed. If you'd like, you can open a PR for this.

For your particular problem, however, I believe that overriding train_step_single (if you work on the current master branch) to look similar to what you proposed could be the better approach. The reason is that I would normally not expect another training step to be made within a callback.

One of our goals with skorch was to make it easy to subclass NeuralNet and tinker with its methods. There are too many NN architectures to be able to cover them all with one class. We do that all the time for our own projects. If you find subclassing difficult, tell us where we could improve further.

@ottonemo
Copy link
Member

The question is if things like adversarial training can be modularised and reused for different architectures. It would be nice to have a pluggable virtual adversarial training callback but I'm not sure if this is feasible.

In general I'm with Benjamin on this, you should resort to using the object framework of python to make such fundamental changes to a net.

@spott
Copy link
Contributor Author

spott commented Jul 24, 2018

In general I'm with Benjamin on this, you should resort to using the object framework of python to make such fundamental changes to a net.

I feel like this kind of thing -- data augmentation or regularization -- shouldn't really need a whole new NeuralNet class to make work. That just makes it harder to play around with adding and removing these kinds of regularizers... ideally you should be able to do a grid search with and without adversarial training!

If you'd like, you can open a PR for this.

Will do.

@benjamin-work
Copy link
Contributor

I feel like this kind of thing -- data augmentation or regularization -- shouldn't really need a whole new NeuralNet class to make work

There are many ways to augment data or apply regularization. Therefore, there will never be one class that can master them all.

We have taken care of many things. For example, weight decay/L1/L2 regularization are already handled quite well. Training time feature augmentation can often handled by the DataLoader (e.g. image augmentation). Data preprocessing is handled well by sklearn Pipelines.

Your particular example is different because it requires the gradient for augmentation. Additionally, GANs typically also require overriding methods in skorch. But there too, there are so many different implementations that it's hard to cover them all. I could, however, imagine that a GanNeuralNet could be useful.

ideally you should be able to do a grid search with and without adversarial training!

This is already possible without too much hassle. For your example above, you need to introduce a new argument on NeuralNet, then you can adjust it with grid search:

class AdversarialNet(NeuralNet):
    def __init__(self, *args, use_adversarial_training=False, **kwargs):
        super().__init__(*args, **kwargs)
        self.use_adversarial_training = use_adversarial_training

    def train_step_single(...):
        ...
        if self.use_adversarial_training:
           ...

param_grid = {'use_adversarial_training': [True, False], ...}
search = GridSearchCV(AdversarialNet(...), param_grid)

@spott
Copy link
Contributor Author

spott commented Jul 25, 2018

There are many ways to augment data or apply regularization. Therefore, there will never be one class that can master them all.

Agreed... but there should be one class that can do most of it.

I use Skorch primarily for making my work reproducible, and recordable. Without Skorch, saving a model in a way that it will retrain the same requires saving a whole bunch of things: The NN class itself, the instantiation code for the NN and the optimizer and the criterion, the training loop code, the DataLoader parameters, the datasets, the random seeds and probably a few others that I'm forgetting right now. The benefit that Skorch gives me is that I only need to save the instantiation of the NeuralNet class (and the dataset). That one short block of code gives me all the information that I need to reproduce a model run.

I'm not opposed to a different NeuralNet class (I will likely need to create one soon for a different idea), but it adds more code that I need to keep track of. If I need to modify the NeuralNet class whenever I want to add something to the net, then I need to version and keep track of that class vs. the callbacks which are simple and atomic enough that they don't need to change once I have made them "bug free".

In this particular case, I think that adding the batch data to the on_grad_computed callback is enough to do what I want to do, and is enough for a whole class of training loop modifications.

I could, however, imagine that a GanNeuralNet could be useful.

I could as well, and a GAN seems to be a very valid reason to create a new NeuralNet class. I just think that creating a new NeuralNet class should be avoided when it is possible to do so.

@benjamin-work
Copy link
Contributor

I believe we largely agree on what should and what shouldn't be done. The only missing piece in the puzzle is what use cases are general enough to require a built-in solution. Unfortunately, this kind of data is hard to come by.

In this particular case, I think that adding the batch data to the on_grad_computed callback is enough to do what I want to do, and is enough for a whole class of training loop modifications.

Do you want to take this?

@spott
Copy link
Contributor Author

spott commented Jul 25, 2018

Yea, I will. I'll try and get that done by the end of the day.

@spott
Copy link
Contributor Author

spott commented Jul 25, 2018

Is there a set of tests somewhere that I should run?

@taketwo
Copy link
Contributor

taketwo commented Jul 25, 2018

If you want to help developing, run:

git clone https://github.com/dnouri/skorch.git
cd skorch
# create and activate a virtual environment
pip install -r requirements.txt
# install pytorch version for your system (see below)
pip install -r requirements-dev.txt
python setup.py develop

py.test  # unit tests
pylint skorch  # static code checks

(this comes from the README)

@spott
Copy link
Contributor Author

spott commented Jul 27, 2018

Oops.

@taketwo: Thanks for that, I should have read that before asking.

@benjamin-work
Copy link
Contributor

@spott Did #297 solve your issue?

@spott
Copy link
Contributor Author

spott commented Aug 16, 2018

It mostly did. However, there isn't a good way to run a training step from inside a callback (especially the on_grad_computed callback, where running another training step will cause another call to on_grad_computed), so creating an adversarial example callback is a little hacky.

But maybe those should be in a different issue.

@zachbellay
Copy link

@benjamin-work In your AdversarialNet example, I'm confused as to where the pass in the discriminator if using a GAN architecture. My understanding is that you would pass in the generator you want to train as the module to AdversarialNet, and then you would override train_step_single to get the loss from the generator/discriminator adversarial training. However, this leaves out training the discriminator. Could you perhaps provide a more detailed example of AdversarialNet that includes the generator and discriminator and the code required to train them?

@BenjaminBossan
Copy link
Collaborator

@zachbellay Without knowing anything about your specific case, would it be possible to have the discriminator and the generator be submodules of the same overarching module? Then the module could have those two components as attributes that you can use depending on which of them needs to be trained.

In general, we should try to provide a template for GANs in skorch, but since I never use them personally, it's hard for me to do that. Maybe if there's a good pointer to existing pytorch code that could be ported to skorch, we could work on that.

@zachbellay
Copy link

@BenjaminBossan It would be possible to have them in the same overarching module, although I'm less certain of how to appropriately integrate the two into the single Skorch module.

My use case is basically a beefed up version of DCGAN. Here is a good Pytorch implementation that is very similar to my use case. Thanks again for your help!

@BenjaminBossan
Copy link
Collaborator

Thank you @zachbellay, I'll have a look at this as soon as I've got some time on my hands and see if it's possible to port to skorch without too much gymnastics.

@YannDubs
Copy link
Contributor

@benjamin-work that would be very useful for me also. I had some GAN type training to do, and always used some tricks to make it work in skorch. Essentially the problem comes from the fact that for GANs you need 2 models, 2 losses, 2 optimizers, 2 alternating optimization step (minimax game does not converge by joint optimization). Concerning models and loss, this is not an issue as one can combine them in a single module | loss (although this will not save both losses to the history). Optimization is more tricky, indeed skorch basically does the following very naturally : compute output, compute loss, optimization step. Here we need 2x this loop where the second training (discriminator) depends on the output of the first loop. So even if we assume that we used the same optimizer / scheduling for both models (I don't think people do, but this would require many more changes and parameter groups can already do quite a lot), you would still have to either:

  • Proposition 1 : put the 2 training loops in a single one (i.e. do all output-loss-optim step only once)
  • Proposition 2: modify the logic of train_step to sequentially do the generator's output-loss-optim and discriminator's output-loss-optim.

Taking from the link posted by @zachbellay, here's essentially the minimum to achieve :

#  Train Generator
# -----------------
optimizer_G.zero_grad()
gen_imgs = generator(noise())
g_pred_fake = discriminator(gen_imgs)
g_loss = generator_loss(g_pred_fake)
g_loss.backward()
optimizer_G.step()

#  Train Discriminator
# ---------------------
optimizer_D.zero_grad()
d_pred_real = discriminator(real_imgs)
d_pred_fake = discriminator(gen_imgs.detach())
d_loss = discriminator_loss(d_pred_real, d_pred_fake)
d_loss.backward()
optimizer_D.step()

Here are the proposition outlines (not working code)

Proposition 1 :

# Output
# -------
gen_imgs = generator(noise())
set_requires_grad(discriminator, False)  # the following should not backprop through discriminator
g_pred_fake = discriminator(gen_imgs)
set_requires_grad(discriminator, True) # discriminator back on
d_pred_real = discriminator(real_imgs)
d_pred_fake = discriminator(gen_imgs.detach()) # don't backprop through generator

# Loss
# -------
loss = generator_loss(g_pred_fake) + discriminator_loss(d_pred_real, d_pred_fake)

# Optimize
# -------
loss.backward()
optimizer.step()

This is very easy to do in skorch , please double check that the gradients are actually correct (what is backpropagated where), but I think it is. The major issue here is flexibility. Indeed the theory says (if I'm not mistaken) to do 1) multiple optimization steps of the discriminator, 2) update the discriminator using the latest generator. None of those are done in the given link (i'm not sure about SOTA GAN), but both would be basically impossible in a single output-loss-optim loop.

Proposition 2 :

def train_step(self, Xi, yi, **fit_params):
    step_accumulator = self.get_train_step_accumulator()
    def step_fn():
        step = self.train_step_single(Xi, yi, **fit_params)
        step_accumulator.store_step(step) 
        return step['loss']
    
    self.module_.mode="generator"
    self.criterion_.mode="generator"
    self.optimizer_.step(step_fn_gen)
    self.optimizer_.zero_grad()
    gen_pred = step_accumulator.get('y_pred')
    X_i = gen_pred # use generated as input to discrim

    self.module_.mode="discriminator"
    self.criterion_.mode="discriminator"
    for _ in range(k_discriminator_steps):
         self.optimizer_.step(step_fn)
         self.optimizer_.zero_grad()

    return step_accumulator.get_step()

Here the method is theoretically sound and flexible, but not as clean (basically using some flags to say when the loss or model should be in discrim or generator mode). Note that step_accumulator also has to be modified.

Let me know what you think and @zachbellay if that would work for GANs as I don't have any experience with those.

@BenjaminBossan
Copy link
Collaborator

@YannDubs Thank you for the proposal and clean explanation.

Essentially the problem comes from the fact that for GANs you need 2 models, 2 losses, 2 optimizers, 2 alternating optimization step (minimax game does not converge by joint optimization). Concerning models and loss, this is not an issue as one can combine them in a single module | loss (although this will not save both losses to the history). Optimization is more tricky, indeed skorch basically does the following very naturally : compute output, compute loss, optimization step. Here we need 2x this loop where the second training (discriminator) depends on the output of the first loop.

And I wouldn't be surprised if there were applications where even that is not enough.

At the end of the day, I wonder how much sense it makes to "contort" skorch to make it work this way. On the one hand, I'm flattered that people try to use it for even more unconventional cases, on the other it might just not be the best tool (at the moment).

Basically, at the moment, the user would need to implement their own train_step, as you suggest. Of course they can already do that already, but it's too easy right now to do it wrong (e.g. forgetting to notify on_grad_computed or not returning the right values in step).

The main challenges I foresee are the logging/callbacks and the get_params/set_params part. The latter could probably be fixed with some improved tooling and documentation. E.g., we already have params_for, which could be used to dispatch to two separate optimizers.

As I said in the other thread, when I have time, I'll try my hands again on the topic. Any kind of feedback, hints, existing repos with concrete implementations, are appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants