RuntimeError: CUDA out of memory

Hey guys

I currently got an insufficient GPU memory error with the config below. Training on 8 x V100 GPUs.

It doesn’t appear imidiately though, but rather non-deterministicly far into the training, which rather points to a memory leak somewhere. Would you have some tips or ideas how to approach this?

Any ideas?

training_args = TrainingArguments(
output_dir="./wav2vec2-xlsr-sg-g",
logging_dir=’./logs’,
group_by_length=True,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
evaluation_strategy=“steps”,
num_train_epochs=30,
fp16=False,
save_steps=400,
eval_steps=400,
logging_steps=400,
learning_rate=3e-4,
warmup_steps=500,
save_total_limit=2,
)

1 Like

The non-determinism might arise if your batches aren’t sized uniformly? Without more detail on your training data, it’s just a wild guess.

You might try enabling fp16. This will give you a lot more breathing room, even if it doesn’t explain the root cause…