RuntimeError: CUDA out of memory

anon23296544 · March 17, 2021, 10:31pm

Hey guys

I currently got an insufficient GPU memory error with the config below. Training on 8 x V100 GPUs.

It doesn’t appear imidiately though, but rather non-deterministicly far into the training, which rather points to a memory leak somewhere. Would you have some tips or ideas how to approach this?

Any ideas?

training_args = TrainingArguments(
output_dir="./wav2vec2-xlsr-sg-g",
logging_dir=’./logs’,
group_by_length=True,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
evaluation_strategy=“steps”,
num_train_epochs=30,
fp16=False,
save_steps=400,
eval_steps=400,
logging_steps=400,
learning_rate=3e-4,
warmup_steps=500,
save_total_limit=2,
)

clin · April 15, 2021, 2:07am

The non-determinism might arise if your batches aren’t sized uniformly? Without more detail on your training data, it’s just a wild guess.

You might try enabling fp16. This will give you a lot more breathing room, even if it doesn’t explain the root cause…

Topic		Replies	Views
Cuda out of memory error Intermediate	11	41718	January 27, 2025
RuntimeError: CUDA out of memory. Tried to allocate 11.53 GiB (GPU 0; 15.90 GiB total capacity; 4.81 GiB already allocated; 8.36 GiB free; 6.67 GiB reserved in total by PyTorch) Beginners	4	3067	April 20, 2021
RuntimeError: CUDA out of memory. Tried to allocate 1.91 GiB (GPU 0; 15.78 GiB total capacity; 12.36 GiB already allocated; 302.75 MiB free; 14.16 GiB reserved in total by PyTorch) Beginners	2	1324	September 11, 2021
CUDA out of memory for Longformer Beginners	6	1268	October 22, 2021
CUDA out of memory only during validation not training 🤗Transformers	3	4522	May 9, 2023

RuntimeError: CUDA out of memory

Related topics