Best practices for data loading in Jax #13339

rdilip · 2022-11-21T19:04:50Z

rdilip
Nov 21, 2022

Is there a reference to some best generic practices for data loading? I know this is a very generic question, but I've noted that because Jax doesn't have pre-built dataloaders, I end up using some combination of Tensorflow and Jax, and I'm never sure whether what I'm doing is optimal. I think in particular there are three questions that would be good to have documented somewhere (I've read Jax's memory model documentation page, but still wasn't sure)

How should one optimally convert tensorflow objects to jax arrays? One can, for instance, just do a ds.map(lambda x: jnp.array(x)), but I believe people have had performance issues there. I've also seen it using dlpack as a stepping stone (see https://stackoverflow.com/questions/69782818/turn-a-tf-data-dataset-to-a-jax-numpy-iterator).
A pattern I've encountered in several projects is as follows: I want to load raw data to the CPU, do some pre-processing, potentially cache the processed data, and then load the data with additional transformations (i.e., the first transformation set is deterministic and possibly expensive, and the second transformation set is probabilistic and cheap, e.g., some sort of data augmentation). In Tensorflow, we have tf.device, and in jax, we have methods like jax.device_put. Is the correct pattern to load the dataset using tensorflow onto the cpu, then use jax on the GPU? e.g., something like

with tf.device('/cpu:0'):
    ds = get_dataset()

for iteration in range(num_iterations):
    do_jax_training_step(jax.device_put(next(ds), jax.devices('gpu')[0])

What's the correct use pattern for device_put_sharded? In particular, often I'm on a cloud instance with a gpu, but jax.devices() only shows the gpu. jax.devices('cpu'), however, will show a cpu; despite this, any of the "sharding" will throw an error because there's only one device in jax.devices(), when I would expect there to be two.

patrick-kidger · 2022-11-22T03:15:14Z

patrick-kidger
Nov 22, 2022

Whilst I don't really have the answers to these, this thread is also relevant/interesting:
patrick-kidger/equinox#137

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for data loading in Jax #13339

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best practices for data loading in Jax #13339

Uh oh!

rdilip Nov 21, 2022

Replies: 1 comment

Uh oh!

patrick-kidger Nov 22, 2022

rdilip
Nov 21, 2022

patrick-kidger
Nov 22, 2022