Skip to content

Wav2Vec2 CUDA memory usage doubled in v4.11.3 compared to v4.10.3 with the same batch size #14388

@MarktHart

Description

@MarktHart

Environment info

  • transformers version: 4.11.3
  • Platform: Linux-5.11.0-40-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.8.1+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes, 3090
  • Using distributed or parallel set-up in script?: No

Who can help

@patrickvonplaten, @anton-l

Information

When using Wav2vec2 the memory usage roughly doubles when going from Huggingface v4.10.3 to v4.11.3
Whereas my 3090 (24GB memory) in v4.10.3 could handle a batchsize of ~32, in 4.11.3 this is reduced to ~10.

The problem arises when using:

  • my own modified scripts

The tasks I am working on is:

  • ASR

To reproduce

Steps to reproduce the behavior:

  1. Run script with v4.10 and v4.11 and watch CUDA memory usage

Reproduce script (relatively minimal):

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor, TrainingArguments
from transformers.trainer import Trainer
from torch.utils.data.dataset import Dataset
import numpy as np

class ProcessedDataset(Dataset):
    def __init__(self, processor):
        self.processor = processor

    def __getitem__(self, i):
        x = np.ones(16000 * 10) # 10 seconds
        y = "this is a random sentence"
        with self.processor.as_target_processor():
            batch= {"labels": self.processor(y).input_ids}
        batch["input_values"] = self.processor(x, sampling_rate=16000).input_values
        return batch

    def __len__(self):
        return 10000

class DataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, features):
        input_features = [{"input_values": feature["input_values"][0]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        batch = self.processor.pad(
            input_features,
            padding=True,
            max_length=None,
            pad_to_multiple_of=None,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=True,
                max_length=None,
                pad_to_multiple_of=None,
                return_tensors="pt",
            )
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        batch["labels"] = labels
        return batch


proc = Wav2Vec2Processor.from_pretrained("wietsedv/wav2vec2-large-xlsr-53-dutch")
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-nl-voxpopuli",
    attention_dropout=0,
    hidden_dropout=0,
    feat_proj_dropout=0,
    mask_time_prob=0,
    layerdrop=0,
    activation_dropout=0,
    gradient_checkpointing=True,
    ctc_loss_reduction="mean",
    pad_token_id=proc.tokenizer.pad_token_id,
    vocab_size=len(proc.tokenizer),
    ctc_zero_infinity=True
)
ds = ProcessedDataset(proc)
data_collator = DataCollator(processor=proc)
args = TrainingArguments(
    output_dir="/tmp/tmp_model",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    do_eval=False,
    num_train_epochs=1,
    fp16=True,
    group_by_length=False,
    save_steps=-1,
    eval_steps=1024,
    logging_steps=1024,
    warmup_steps=128,
    save_total_limit=1,
    dataloader_num_workers=1,
    seed=11
)

trainer = Trainer(model=model, args=args, train_dataset=ds, data_collator=data_collator)
trainer.train()

Expected behavior

Upgrading Huggingface Transformers from 4.10 to a later version should keep the memory usage in the same ballpark

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions