RuntimeError: CUDA out of memory. Tried to allocate 11.53 GiB (GPU 0; 15.90 GiB total capacity; 4.81 GiB already allocated; 8.36 GiB free; 6.67 GiB reserved in total by PyTorch)

After I run trainer.train and I try to predict the wer of the model, I always get this output.
How to solve it?

The code is below:
def predict(batch, model):

input_dict = processor(batch["input_values"], sampling_rate=16000, return_tensors='pt',padding=True)

logits = model(input_dict.input_values.to(device)).logits

pred_ids = torch.argmax(logits, dim=-1)[0]

batch['pred_ids'] = processor.decode(pred_ids)

return batch

from transformers import TrainingArguments

from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(

model_name, 

attention_dropout=0.1,

hidden_dropout=0.1,

feat_proj_dropout=0.0,

mask_time_prob=0.05,

layerdrop=0.1,

gradient_checkpointing=True, 

ctc_loss_reduction="mean", 

pad_token_id=processor.tokenizer.pad_token_id,

vocab_size=len(processor.tokenizer)

)

training_args = TrainingArguments(

output_dir="/content/gdrive/MyDrive/wav2vec2-large-xlsr-portuguese-demo/modelo",

output_dir="./wav2vec2-large-xlsr-portuguese-demo",

group_by_length=True,

per_device_train_batch_size=16,

gradient_accumulation_steps=2,

evaluation_strategy=“steps”,

num_train_epochs=5,

fp16=True,

save_steps=400,

eval_steps=400,

logging_steps=400,

learning_rate=3e-4,

warmup_steps=500,

save_total_limit=2,

)

from transformers import Trainer

trainer = Trainer(

model=model,

data_collator=data_collator,

args=training_args,

compute_metrics=compute_metrics,

train_dataset=d_train,

eval_dataset=d_val,

tokenizer=processor.feature_extractor,

)

If you want to acceed to the whole project, it is avaiable at:

Try this:

import gc

gc.collect()

torch.cuda.empty_cache()

1 Like

It isn’t working :’( and the data is not that large :cry:

i had the same problem and restarting my notebook-kernel helped
another time i got that problem i had another notebook-project open, closing it & restarting my “main” notebook helped there

You’re running out of memory for whatever reason. You can try making your batch size smaller, and use gradient accumulation. You can also try using mixed precision training. You can find out what these terms mean in the documentation for Trainer class here.

2 Likes