Run_mlm.py cuda error memory after resuming a training

IamAdiSri · April 17, 2021, 4:09am

I’m running into the same issue but with the mBART model. For some reason, running training from scratch with the Seq2SeqTrainer works just fine, but resuming from checkpoint exceeds the memory limit, and produces a CUDA ‘out of memory’ error.

I think it might be related to this issue on the GitHub repository.

@sshleifer I think this is another issue with training large models, as we discussed here although this just seems to be a bug in the trainer.

Topic		Replies	Views
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch) Beginners	8	27432	December 10, 2023
Resuming training fails with CUDA out of memory error Beginners	1	1123	October 13, 2023
Cuda out of memory error Intermediate	11	41581	January 27, 2025
CUDA out of memory only during validation not training 🤗Transformers	3	4517	May 9, 2023
AutoTrain Advanced UI CUDA out of memory error 🤗AutoTrain	6	1106	January 17, 2024

Run_mlm.py cuda error memory after resuming a training

Related topics