Skip to content

Advice with GPU computations in collator/pre-process #3691

@saikoneru

Description

@saikoneru

Hello,

I would appreciate any advice/tips on this.

A common use-case these days is to use speech/vision embeddings from model A and join them with another model B. When these models and datasets are big, pre-computing and writing them to disk is very slow. Currently i do it in shards by pre-computing embeddings offline, save it in dataset and later use them in my model training.

However, another alternative is to use streaming approach. You can generate the embeddings during preprocess or in the data collator. However, i keep getting the error that the input_ids and weights are on cpu and cuda:0. When i move the inputs to device, i get the following error.

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Now how can one do GPU computation in the pre-process? I need to get the embeddings from both model A and model B to prepare my final input_embeds batch.

The only approach i can see is to do this inside the model code but would appreciate if they have any other tips/advice.

Thank you!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions