A callback for modifying the loss before the optimizer obtains it. #295

spott · 2018-07-23T19:26:01Z

There is a class of things that don't appear to be able to be possible within the current callback framework of Skorch.

The first one that comes to mind is in adversarial training. In order to do efficiently, you would want to run all samples through the network once, create new samples by adding some noise that is dependent on the gradient to the old samples, then run those through again:

Xi.requires_grad = True
output = net(Xi)
loss = criterion(output, yi)
loss.backward()

sign = torch.ge(0, Xi.grad)
sign = (sign - 0.5)/.5
Xi_at = torch.add(Xi, -epsilon, sign)

at_output = net(Xi_at)
loss_at = criterion(at_output, yi)
loss_at.backward()

optimizer.step()

Unfortunately, this isn't actually possible using the current framework (without creating a new NeuralNet class).

This could probably be fixed if Xi and yi were passed to the on_grad_computed callback, but it isn't currently.

The text was updated successfully, but these errors were encountered:

benjamin-work · 2018-07-24T07:39:13Z

In general, I see no obstacle to passing Xi and yi to on_grad_computed. If you'd like, you can open a PR for this.

For your particular problem, however, I believe that overriding train_step_single (if you work on the current master branch) to look similar to what you proposed could be the better approach. The reason is that I would normally not expect another training step to be made within a callback.

One of our goals with skorch was to make it easy to subclass NeuralNet and tinker with its methods. There are too many NN architectures to be able to cover them all with one class. We do that all the time for our own projects. If you find subclassing difficult, tell us where we could improve further.

ottonemo · 2018-07-24T09:54:07Z

The question is if things like adversarial training can be modularised and reused for different architectures. It would be nice to have a pluggable virtual adversarial training callback but I'm not sure if this is feasible.

In general I'm with Benjamin on this, you should resort to using the object framework of python to make such fundamental changes to a net.

spott · 2018-07-24T14:18:27Z

In general I'm with Benjamin on this, you should resort to using the object framework of python to make such fundamental changes to a net.

I feel like this kind of thing -- data augmentation or regularization -- shouldn't really need a whole new NeuralNet class to make work. That just makes it harder to play around with adding and removing these kinds of regularizers... ideally you should be able to do a grid search with and without adversarial training!

If you'd like, you can open a PR for this.

Will do.

benjamin-work · 2018-07-25T07:43:33Z

I feel like this kind of thing -- data augmentation or regularization -- shouldn't really need a whole new NeuralNet class to make work

There are many ways to augment data or apply regularization. Therefore, there will never be one class that can master them all.

We have taken care of many things. For example, weight decay/L1/L2 regularization are already handled quite well. Training time feature augmentation can often handled by the DataLoader (e.g. image augmentation). Data preprocessing is handled well by sklearn Pipelines.

Your particular example is different because it requires the gradient for augmentation. Additionally, GANs typically also require overriding methods in skorch. But there too, there are so many different implementations that it's hard to cover them all. I could, however, imagine that a GanNeuralNet could be useful.

ideally you should be able to do a grid search with and without adversarial training!

This is already possible without too much hassle. For your example above, you need to introduce a new argument on NeuralNet, then you can adjust it with grid search:

class AdversarialNet(NeuralNet):
    def __init__(self, *args, use_adversarial_training=False, **kwargs):
        super().__init__(*args, **kwargs)
        self.use_adversarial_training = use_adversarial_training

    def train_step_single(...):
        ...
        if self.use_adversarial_training:
           ...

param_grid = {'use_adversarial_training': [True, False], ...}
search = GridSearchCV(AdversarialNet(...), param_grid)

spott · 2018-07-25T15:51:55Z

There are many ways to augment data or apply regularization. Therefore, there will never be one class that can master them all.

Agreed... but there should be one class that can do most of it.

I use Skorch primarily for making my work reproducible, and recordable. Without Skorch, saving a model in a way that it will retrain the same requires saving a whole bunch of things: The NN class itself, the instantiation code for the NN and the optimizer and the criterion, the training loop code, the DataLoader parameters, the datasets, the random seeds and probably a few others that I'm forgetting right now. The benefit that Skorch gives me is that I only need to save the instantiation of the NeuralNet class (and the dataset). That one short block of code gives me all the information that I need to reproduce a model run.

I'm not opposed to a different NeuralNet class (I will likely need to create one soon for a different idea), but it adds more code that I need to keep track of. If I need to modify the NeuralNet class whenever I want to add something to the net, then I need to version and keep track of that class vs. the callbacks which are simple and atomic enough that they don't need to change once I have made them "bug free".

In this particular case, I think that adding the batch data to the on_grad_computed callback is enough to do what I want to do, and is enough for a whole class of training loop modifications.

I could, however, imagine that a GanNeuralNet could be useful.

I could as well, and a GAN seems to be a very valid reason to create a new NeuralNet class. I just think that creating a new NeuralNet class should be avoided when it is possible to do so.

benjamin-work · 2018-07-25T16:01:38Z

I believe we largely agree on what should and what shouldn't be done. The only missing piece in the puzzle is what use cases are general enough to require a built-in solution. Unfortunately, this kind of data is hard to come by.

In this particular case, I think that adding the batch data to the on_grad_computed callback is enough to do what I want to do, and is enough for a whole class of training loop modifications.

Do you want to take this?

spott · 2018-07-25T16:11:17Z

Yea, I will. I'll try and get that done by the end of the day.

spott · 2018-07-25T16:18:46Z

Is there a set of tests somewhere that I should run?

taketwo · 2018-07-25T16:21:31Z

If you want to help developing, run:

git clone https://github.com/dnouri/skorch.git
cd skorch
# create and activate a virtual environment
pip install -r requirements.txt
# install pytorch version for your system (see below)
pip install -r requirements-dev.txt
python setup.py develop

py.test  # unit tests
pylint skorch  # static code checks

(this comes from the README)

spott · 2018-07-27T17:10:14Z

Oops.

@taketwo: Thanks for that, I should have read that before asking.

benjamin-work · 2018-08-16T12:41:11Z

@spott Did #297 solve your issue?

spott · 2018-08-16T16:31:39Z

It mostly did. However, there isn't a good way to run a training step from inside a callback (especially the on_grad_computed callback, where running another training step will cause another call to on_grad_computed), so creating an adversarial example callback is a little hacky.

But maybe those should be in a different issue.

zachbellay · 2020-02-03T01:32:24Z

@benjamin-work In your AdversarialNet example, I'm confused as to where the pass in the discriminator if using a GAN architecture. My understanding is that you would pass in the generator you want to train as the module to AdversarialNet, and then you would override train_step_single to get the loss from the generator/discriminator adversarial training. However, this leaves out training the discriminator. Could you perhaps provide a more detailed example of AdversarialNet that includes the generator and discriminator and the code required to train them?

BenjaminBossan · 2020-02-03T21:20:32Z

@zachbellay Without knowing anything about your specific case, would it be possible to have the discriminator and the generator be submodules of the same overarching module? Then the module could have those two components as attributes that you can use depending on which of them needs to be trained.

In general, we should try to provide a template for GANs in skorch, but since I never use them personally, it's hard for me to do that. Maybe if there's a good pointer to existing pytorch code that could be ported to skorch, we could work on that.

zachbellay · 2020-02-04T21:27:55Z

@BenjaminBossan It would be possible to have them in the same overarching module, although I'm less certain of how to appropriately integrate the two into the single Skorch module.

My use case is basically a beefed up version of DCGAN. Here is a good Pytorch implementation that is very similar to my use case. Thanks again for your help!

BenjaminBossan · 2020-02-05T20:36:32Z

Thank you @zachbellay, I'll have a look at this as soon as I've got some time on my hands and see if it's possible to port to skorch without too much gymnastics.

YannDubs · 2020-02-10T04:30:51Z

@benjamin-work that would be very useful for me also. I had some GAN type training to do, and always used some tricks to make it work in skorch. Essentially the problem comes from the fact that for GANs you need 2 models, 2 losses, 2 optimizers, 2 alternating optimization step (minimax game does not converge by joint optimization). Concerning models and loss, this is not an issue as one can combine them in a single module | loss (although this will not save both losses to the history). Optimization is more tricky, indeed skorch basically does the following very naturally : compute output, compute loss, optimization step. Here we need 2x this loop where the second training (discriminator) depends on the output of the first loop. So even if we assume that we used the same optimizer / scheduling for both models (I don't think people do, but this would require many more changes and parameter groups can already do quite a lot), you would still have to either:

Proposition 1 : put the 2 training loops in a single one (i.e. do all output-loss-optim step only once)
Proposition 2: modify the logic of train_step to sequentially do the generator's output-loss-optim and discriminator's output-loss-optim.

Taking from the link posted by @zachbellay, here's essentially the minimum to achieve :

#  Train Generator
# -----------------
optimizer_G.zero_grad()
gen_imgs = generator(noise())
g_pred_fake = discriminator(gen_imgs)
g_loss = generator_loss(g_pred_fake)
g_loss.backward()
optimizer_G.step()

#  Train Discriminator
# ---------------------
optimizer_D.zero_grad()
d_pred_real = discriminator(real_imgs)
d_pred_fake = discriminator(gen_imgs.detach())
d_loss = discriminator_loss(d_pred_real, d_pred_fake)
d_loss.backward()
optimizer_D.step()

Here are the proposition outlines (not working code)

Proposition 1 :

# Output
# -------
gen_imgs = generator(noise())
set_requires_grad(discriminator, False)  # the following should not backprop through discriminator
g_pred_fake = discriminator(gen_imgs)
set_requires_grad(discriminator, True) # discriminator back on
d_pred_real = discriminator(real_imgs)
d_pred_fake = discriminator(gen_imgs.detach()) # don't backprop through generator

# Loss
# -------
loss = generator_loss(g_pred_fake) + discriminator_loss(d_pred_real, d_pred_fake)

# Optimize
# -------
loss.backward()
optimizer.step()

This is very easy to do in skorch , please double check that the gradients are actually correct (what is backpropagated where), but I think it is. The major issue here is flexibility. Indeed the theory says (if I'm not mistaken) to do 1) multiple optimization steps of the discriminator, 2) update the discriminator using the latest generator. None of those are done in the given link (i'm not sure about SOTA GAN), but both would be basically impossible in a single output-loss-optim loop.

Proposition 2 :

def train_step(self, Xi, yi, **fit_params):
    step_accumulator = self.get_train_step_accumulator()
    def step_fn():
        step = self.train_step_single(Xi, yi, **fit_params)
        step_accumulator.store_step(step) 
        return step['loss']
    
    self.module_.mode="generator"
    self.criterion_.mode="generator"
    self.optimizer_.step(step_fn_gen)
    self.optimizer_.zero_grad()
    gen_pred = step_accumulator.get('y_pred')
    X_i = gen_pred # use generated as input to discrim

    self.module_.mode="discriminator"
    self.criterion_.mode="discriminator"
    for _ in range(k_discriminator_steps):
         self.optimizer_.step(step_fn)
         self.optimizer_.zero_grad()

    return step_accumulator.get_step()

Here the method is theoretically sound and flexible, but not as clean (basically using some flags to say when the loss or model should be in discrim or generator mode). Note that step_accumulator also has to be modified.

Let me know what you think and @zachbellay if that would work for GANs as I don't have any experience with those.

BenjaminBossan · 2020-02-10T19:47:52Z

@YannDubs Thank you for the proposal and clean explanation.

Essentially the problem comes from the fact that for GANs you need 2 models, 2 losses, 2 optimizers, 2 alternating optimization step (minimax game does not converge by joint optimization). Concerning models and loss, this is not an issue as one can combine them in a single module | loss (although this will not save both losses to the history). Optimization is more tricky, indeed skorch basically does the following very naturally : compute output, compute loss, optimization step. Here we need 2x this loop where the second training (discriminator) depends on the output of the first loop.

And I wouldn't be surprised if there were applications where even that is not enough.

At the end of the day, I wonder how much sense it makes to "contort" skorch to make it work this way. On the one hand, I'm flattered that people try to use it for even more unconventional cases, on the other it might just not be the best tool (at the moment).

Basically, at the moment, the user would need to implement their own train_step, as you suggest. Of course they can already do that already, but it's too easy right now to do it wrong (e.g. forgetting to notify on_grad_computed or not returning the right values in step).

The main challenges I foresee are the logging/callbacks and the get_params/set_params part. The latter could probably be fixed with some improved tooling and documentation. E.g., we already have params_for, which could be used to dispatch to two separate optimizers.

As I said in the other thread, when I have time, I'll try my hands again on the topic. Any kind of feedback, hints, existing repos with concrete implementations, are appreciated.

spott mentioned this issue Jul 26, 2018

Feature/grad add data to callback 'on_grad_computed' #297

Merged

spott closed this as completed Jul 27, 2018

spott reopened this Jul 27, 2018

thomasjpfan mentioned this issue Oct 29, 2018

Thoughts on the on_grad_computed API #378

Closed

ottonemo added the enhancement label Dec 12, 2018

YannDubs mentioned this issue Feb 10, 2020

Add notebook with DCGAN on MNIST #587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A callback for modifying the loss before the optimizer obtains it. #295

A callback for modifying the loss before the optimizer obtains it. #295

spott commented Jul 23, 2018

benjamin-work commented Jul 24, 2018

ottonemo commented Jul 24, 2018

spott commented Jul 24, 2018

benjamin-work commented Jul 25, 2018

spott commented Jul 25, 2018 •

edited

Loading

benjamin-work commented Jul 25, 2018

spott commented Jul 25, 2018

spott commented Jul 25, 2018

taketwo commented Jul 25, 2018

spott commented Jul 27, 2018

benjamin-work commented Aug 16, 2018

spott commented Aug 16, 2018

zachbellay commented Feb 3, 2020

BenjaminBossan commented Feb 3, 2020

zachbellay commented Feb 4, 2020

BenjaminBossan commented Feb 5, 2020

YannDubs commented Feb 10, 2020

BenjaminBossan commented Feb 10, 2020

A callback for modifying the loss before the optimizer obtains it. #295

A callback for modifying the loss before the optimizer obtains it. #295

Comments

spott commented Jul 23, 2018

benjamin-work commented Jul 24, 2018

ottonemo commented Jul 24, 2018

spott commented Jul 24, 2018

benjamin-work commented Jul 25, 2018

spott commented Jul 25, 2018 • edited Loading

benjamin-work commented Jul 25, 2018

spott commented Jul 25, 2018

spott commented Jul 25, 2018

taketwo commented Jul 25, 2018

spott commented Jul 27, 2018

benjamin-work commented Aug 16, 2018

spott commented Aug 16, 2018

zachbellay commented Feb 3, 2020

BenjaminBossan commented Feb 3, 2020

zachbellay commented Feb 4, 2020

BenjaminBossan commented Feb 5, 2020

YannDubs commented Feb 10, 2020

BenjaminBossan commented Feb 10, 2020

spott commented Jul 25, 2018 •

edited

Loading