cuda error memory after resuming a training


I am trying to pretrain an XLM Roberta model, but I got some issue. I have trained the model for few thousand steps and get the checkpoint. Then I wanted to continue the pre-training from the checkpoint, but got a memory error “CUDA error memory” after few steps. I wonder if there is a leak of something like that ? The memory used during the 1st pretraining used around 15.X GB / 16.2 GB so I quite don’t understand what’s going on.

1 Like

I’m running into the same issue but with the mBART model. For some reason, running training from scratch with the Seq2SeqTrainer works just fine, but resuming from checkpoint exceeds the memory limit, and produces a CUDA ‘out of memory’ error.

I think it might be related to this issue on the GitHub repository.

@sshleifer I think this is another issue with training large models, as we discussed here although this just seems to be a bug in the trainer.

This fix by @sgugger in a recent commit resolved the issue for me. The fix is not in the official release yet, but you can install the bleeding edge version of the Transformers library from source using pip install git+ which pulls it from GitHub.

1 Like

having the same issue here with GPT-2 large, gonna try what @IamAdiSri suggested

Can confirm that installing the bleeding edge version of transformers as suggested by @IamAdiSri fixed the issue.