Run_mlm.py cuda error memory after resuming a training

Shiro · April 16, 2021, 9:54am

Hello,

I am trying to pretrain an XLM Roberta model, but I got some issue. I have trained the model for few thousand steps and get the checkpoint. Then I wanted to continue the pre-training from the checkpoint, but got a memory error “CUDA error memory” after few steps. I wonder if there is a leak of something like that ? The memory used during the 1st pretraining used around 15.X GB / 16.2 GB so I quite don’t understand what’s going on.

IamAdiSri · April 17, 2021, 4:09am

I’m running into the same issue but with the mBART model. For some reason, running training from scratch with the Seq2SeqTrainer works just fine, but resuming from checkpoint exceeds the memory limit, and produces a CUDA ‘out of memory’ error.

I think it might be related to this issue on the GitHub repository.

@sshleifer I think this is another issue with training large models, as we discussed here although this just seems to be a bug in the trainer.

IamAdiSri · April 20, 2021, 12:09pm

This fix by @sgugger in a recent commit resolved the issue for me. The fix is not in the official release yet, but you can install the bleeding edge version of the Transformers library from source using pip install git+https://github.com/huggingface/transformers which pulls it from GitHub.

xusky · April 21, 2021, 2:34pm

having the same issue here with GPT-2 large, gonna try what @IamAdiSri suggested

xusky · April 21, 2021, 3:10pm

Can confirm that installing the bleeding edge version of transformers as suggested by @IamAdiSri fixed the issue.

Topic		Replies	Views
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 11.17 GiB total capacity; 10.62 GiB already allocated; 145.81 MiB free; 10.66 GiB reserved in total by PyTorch) Beginners	8	27453	December 10, 2023
CUDA out of memory when training mt5-XL 🤗Transformers	1	241	March 11, 2024
Setting PyTorch CUDA memory configuration while using HF transformers 🤗Transformers	1	3221	November 23, 2022
Repeated training runs out of GPU memory 🤗Transformers	3	259	December 16, 2024
Hyperparameter Tuning QNLI Colab Example using RoBERTa "RuntimeError('CUDA out of memory..." 🤗Transformers	0	306	May 20, 2021

Run_mlm.py cuda error memory after resuming a training

Related topics