Difference in Memory Usage Between `jacrev(vmap(...))` and `vmap(jacrev(...))` in JAX #26127

unik-w · 2025-01-27T17:50:14Z

unik-w
Jan 27, 2025

I'm computing the Jacobian of a neural network function with respect to its parameters for multiple inputs using Flax/JAX:

f(params, input) -> output

I tried two approaches to calculate the jacobian:

Jacobian after batching:

jit(jacrev(vmap(f, in_axes=(None, 0)), argnums=0))(params, inputs)

Batching after Jacobian:

jit(vmap(jacrev(f, argnums=0), in_axes=(None, 0)))(params, inputs)

Questions:

What difference does it make in the two approaches in terms of execution by Jax?
Why does the first approach (jacrev(vmap(...))) consume significantly more GPU memory when profiled?

I expected XLA to optimize both approaches similarly, but the first one seems much more memory-intensive. (I am assuming its a bug based on my naive understanding of XLA and JIT). Any insights into why this happens would be greatly appreciated.

Answered by jakevdp

Jan 27, 2025

One way to see what's happening is to look at the jaxpr for your computation. This may get pretty complicated in practice, but let's look at a very simple example:

import jax.numpy as jnp
from jax import jit, jacrev, vmap

def f(params, input):
  return params * input

params = 2.0
inputs = jnp.arange(1000.0)
print('jacrev(vmap(f)):')
print(jit(jacrev(vmap(f, in_axes=(None, 0)), argnums=0)).trace(params, inputs).jaxpr)
print()
print('vmap(jacrev(f)):')
print(jit(vmap(jacrev(f, argnums=0), in_axes=(None, 0))).trace(params, inputs).jaxpr)

jacrev(vmap(f)):
{ lambda ; a:f32[] b:f32[1000]. let
    c:f32[] = convert_element_type[new_dtype=float32 weak_type=False] a
    _:f32[1000] = mul c b
   …

View full answer

jakevdp · 2025-01-27T18:12:09Z

jakevdp
Jan 27, 2025
Maintainer

One way to see what's happening is to look at the jaxpr for your computation. This may get pretty complicated in practice, but let's look at a very simple example:

import jax.numpy as jnp
from jax import jit, jacrev, vmap

def f(params, input):
  return params * input

params = 2.0
inputs = jnp.arange(1000.0)
print('jacrev(vmap(f)):')
print(jit(jacrev(vmap(f, in_axes=(None, 0)), argnums=0)).trace(params, inputs).jaxpr)
print()
print('vmap(jacrev(f)):')
print(jit(vmap(jacrev(f, argnums=0), in_axes=(None, 0))).trace(params, inputs).jaxpr)

jacrev(vmap(f)):
{ lambda ; a:f32[] b:f32[1000]. let
    c:f32[] = convert_element_type[new_dtype=float32 weak_type=False] a
    _:f32[1000] = mul c b
    d:i32[1000,1000] = iota[dimension=0 dtype=int32 shape=(1000, 1000)] 
    e:i32[1000,1000] = iota[dimension=1 dtype=int32 shape=(1000, 1000)] 
    f:i32[1000,1000] = add d 0
    g:bool[1000,1000] = eq f e
    h:f32[1000,1000] = convert_element_type[new_dtype=float32 weak_type=False] g
    i:f32[1000,1000] = slice[
      limit_indices=(1000, 1000)
      start_indices=(0, 0)
      strides=None
    ] h
    j:f32[1,1000] = broadcast_in_dim[broadcast_dimensions=(1,) shape=(1, 1000)] b
    k:f32[1000,1000] = mul i j
    l:f32[1000] = reduce_sum[axes=(1,)] k
    m:f32[1000] = convert_element_type[new_dtype=float32 weak_type=True] l
    n:f32[1000] = slice[limit_indices=(1000,) start_indices=(0,) strides=None] m
  in (n,) }

vmap(jacrev(f)):
{ lambda ; a:f32[] b:f32[1000]. let
    c:f32[] = convert_element_type[new_dtype=float32 weak_type=False] a
    _:f32[1000] = mul c b
    d:i32[1,1] = iota[dimension=0 dtype=int32 shape=(1, 1)] 
    e:i32[1,1] = iota[dimension=1 dtype=int32 shape=(1, 1)] 
    f:i32[1,1] = add d 0
    g:bool[1,1] = eq f e
    h:f32[1,1] = convert_element_type[new_dtype=float32 weak_type=False] g
    i:f32[1,1] = slice[limit_indices=(1, 1) start_indices=(0, 0) strides=None] h
    j:f32[1] = reshape[dimensions=None new_sizes=(1,)] i
    k:f32[1,1] = broadcast_in_dim[broadcast_dimensions=(1,) shape=(1, 1)] j
    l:f32[1000,1] = broadcast_in_dim[broadcast_dimensions=(0,) shape=(1000, 1)] b
    m:f32[1000,1] = mul k l
    n:f32[1000,1] = convert_element_type[new_dtype=float32 weak_type=True] m
    o:f32[1000,1] = slice[
      limit_indices=(1000, 1)
      start_indices=(0, 0)
      strides=None
    ] n
    p:f32[1000] = reshape[dimensions=None new_sizes=(1000,)] o
  in (p,) }

Notice the two are almost identical, except for the fact that jacrev(vmap(f)) involves computation on arrays of shape [1000, 1000], while vmap(jacrev(f)) involves computation on arrays of shape [1000, 1].

The reason for this is that the jacobian of a $\mathcal{R}^n\to\mathcal{R}^n$ function is of size $\mathcal{R}^{n,n}$, so if you take jacrev of vmap, you get a single jacobian of size [N, N], and if you take vmap of jacrev you get N jacobians of size [1, 1]. These are equivalent if your function f works element-wise, as the [N, N] matrix is diagonal. If that's the case, then you should do vmap of jacrev for efficiency.

Note that this kind of optimization (recognizing that a matrix is diagonal and rewriting the computation to account for that) is not something XLA's compiler will do automatically.

1 reply

unik-w Jan 27, 2025
Author

Thank you!
That was really helpful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Difference in Memory Usage Between `jacrev(vmap(...))` and `vmap(jacrev(...))` in JAX #26127

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Difference in Memory Usage Between jacrev(vmap(...)) and vmap(jacrev(...)) in JAX #26127

Uh oh!

unik-w Jan 27, 2025

Jacobian after batching:

Batching after Jacobian:

Replies: 1 comment · 1 reply

Uh oh!

jakevdp Jan 27, 2025 Maintainer

Uh oh!

unik-w Jan 27, 2025 Author

Difference in Memory Usage Between `jacrev(vmap(...))` and `vmap(jacrev(...))` in JAX #26127

unik-w
Jan 27, 2025

Replies: 1 comment 1 reply

jakevdp
Jan 27, 2025
Maintainer

unik-w Jan 27, 2025
Author