Recommend way to use jax.jit for generation from transformers #6242

davisyoshida · 2021-03-26T18:45:40Z

davisyoshida
Mar 26, 2021

I've implemented GPT-2 in JAX, but unfortunately generation is currently prohibitively slow, since jax.jit will recompile at each token during the first run, since an increasing amount of cached hidden states are passed in each time. Of course subsequent generations are faster, but on my machine this amounts to almost 2 hours of "warmup" time when trying to run one of these models (~6.6 sec/token * 1024 tokens). I was also unable to avoid this using any of the loop constructs or jax.lax.scan, since the carry will change shape.

The best solution I've thought of is to select a max length a priori, pad all the cached hidden states up to that length outside of the JIT-ed part of the code, then mask the junk computations that result in the network. I'd prefer to avoid doing so, as it both wastes computation and is much less clean.

In order to make clear exactly what I'm talking about, here's a small example which has a similar issue:

import time

import jax
import jax.numpy as jnp

@jax.jit
def func(w, x, past):
    hidden_layer = (w @ x)[None, :]
    if past is not None:
        hidden_layer = jnp.concatenate((past, hidden_layer), axis=0)

    value = hidden_layer @ x 
    return value, hidden_layer

past = None
w = jnp.ones((10, 10))
x = jnp.ones(10)

iters = 500
for _ in range(2):
    past = None
    start = time.time()
    for i in range(iters):
        result, past = func(w, x, past)
    result.block_until_ready()
    print(f'{iters/(time.time() - start)} it/sec')
    # Output:
    # 11.149801775731108 it/sec
    # 4284.965295726992 it/sec

davisyoshida · 2021-03-31T19:37:47Z

davisyoshida
Mar 31, 2021
Author

For now I've decided to convert my JAX weights back to TF after training, as that seems to be the easiest way to get quick generation

0 replies

thisiscam · 2021-05-11T11:24:41Z

thisiscam
May 11, 2021

Perhaps this https://github.com/google/flax/blob/master/examples/lm1b/temperature_sampler.py#L27?

13 replies

davisyoshida May 14, 2021
Author

Assuming you mean using a while loop over the positions to be attended to, I would imagine that that would be much slower than packing it into a matrix multiplication right?

thisiscam May 14, 2021

Assuming you mean using a while loop over the positions to be attended to

Yes that's what I meant.
Well, it's a batched matrix multiplication before (which should lower to dot_general, I suppose), so yes, this could be slower (or not). That will depend on how XLA compiles the primitives to native code. Sorry I don't have knowledge for that... Perhaps we can do some benchmarking.

I still wonder how TF handles this though. TF also uses XLA so it's unclear what jax is missing.

thisiscam May 14, 2021

Could you point to which TF transformer library you are using?

davisyoshida May 14, 2021
Author

I'm using huggingface's GPT2 implementation

thisiscam May 14, 2021

I see. It seems like they just use eager execution. That makes sense.

thisiscam · 2021-05-15T05:22:25Z

thisiscam
May 15, 2021

@davisyoshida Based on our discussion, I have some more ideas:

Combine your code with flax's fixed-cache implementation: only JIT once every few iterations (say, every 8 iterations). This would imply that the cache size grows every 8 new tokens. Since for accelerators the inputs are better padded to these multiples anyways, I think it's likely that you won't lose any performance. But this should trim down the warmup compile time by a factor of 8 -- which is ~ 120 / 8 = 15 mins, probably not bad?
Implement a custom "padded batch matmul" op. The op should multiply two sequences of matrices (in a batched manner), while allowing both arguments to have a stride dimension. This can be implemented by calling blas on CPU or cublas on GPU, and simply use the full matmul (ignore the stride dimension) on TPU.

5 replies

davisyoshida May 15, 2021
Author

Thanks for the suggestions!

The first idea seems workable in the short term. As for the second, I've never implemented a custom op before. Do you have a
pointer to where I can learn about doing that?

thisiscam May 15, 2021

There are some docs in the JAX readthedocs.
Also this: https://github.com/dfm/extending-jax

Disclaimer: I never implemented an op before, but from the look of that repo it seems approachable.

davisyoshida May 15, 2021
Author

Seems like good stuff to know about, so thanks for the pointers either way

thisiscam May 15, 2021

No problem. I'm also facing a similar problem ATM. So I'd be curious to hear about any update.

thisiscam May 17, 2021

@davisyoshida Just FYI, I'm commenting google/flax#920 (comment) for some of my investigations.

Recommend way to use jax.jit for generation from transformers #6242

Uh oh!

Replies: 3 comments · 18 replies

Uh oh!

davisyoshida Mar 31, 2021 Author

Uh oh!

Uh oh!

davisyoshida May 14, 2021 Author

Uh oh!

Uh oh!

Uh oh!

davisyoshida May 14, 2021 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davisyoshida May 15, 2021 Author

Uh oh!

Uh oh!

davisyoshida May 15, 2021 Author

Uh oh!

Uh oh!

Replies: 3 comments 18 replies

davisyoshida
Mar 31, 2021
Author

davisyoshida May 14, 2021
Author

davisyoshida May 14, 2021
Author

davisyoshida May 15, 2021
Author

davisyoshida May 15, 2021
Author