Open
Description
This paper:
https://arxiv.org/pdf/1511.04587.pdf
... suggests "adjustable gradient clipping", which seems to greatly help in training deep networks quickly and efficiently. Basically they suggest to scale the gradient clipping with "1 / learning rate". So they want to clip the gradients to [- clip_gradients / current_lr, + clip_gradients / current_lr].
As far as I can see, Caffe doesn't support this yet, correct? Might be useful to add?