Advice with GPU computations in collator/pre-process

Hello,

I would appreciate any advice/tips on this.

A common use-case these days is to use speech/vision embeddings from model A and join them with another model B. When these models and datasets are big, pre-computing and writing them to disk is very slow. Currently i do it in shards by pre-computing embeddings offline, save it in dataset and later use them in my model training.

However, another alternative is to use streaming approach. You can generate the embeddings during preprocess or in the data collator. However, i keep getting the error that the input_ids and weights are on cpu and cuda:0. When i move the inputs to device, i get the following error.

`Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method  `

Now how can one do GPU computation in the pre-process? I need to get the embeddings from both model A and model B to prepare my final `input_embeds` batch.

The only approach i can see is to do this inside the model code but would appreciate if they have any other tips/advice.

Thank you!!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Advice with GPU computations in collator/pre-process #3691

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Advice with GPU computations in collator/pre-process #3691

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions