torch.cuda.OutOfMemoryError

Hi All, I keep keep getting this error while running transformers train.train():

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB (GPU 0; 39.50 GiB total capacity; 38.72 GiB already allocated; 225.12 MiB free; 38.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Following some advise online, I tried setting PYTORCH_CUDA_ALLOC_CONF to “garbage_collection_threshold:0.6,max_split_size_mb:128” and also adding

torch.cuda.empty_cache()

to my code but that doesn’t help.

So any ideas? My GPU is A100 with 40GB of memory and I use cuda-11.4 and torch-2.0.1

Thanks,

Oren

2 Likes