TPU memory issues

Atreum · May 30, 2021, 9:55pm

Hi! I’m trying to run a modified version of the Accelerate notebook with a Deberta-large and xlarge model.

Here’s the notebook - Google Colaboratory

I’m experiencing severe memory issues. Even with a batch size of 1, I’m getting an explosion of observed memory usage and then Sigkill errors.

Deberta-base works fine, as do other small models. But, going up from models in the base range (~500MB) to the large range and beyond (1.5GB+), results in these memory issues.

This really shouldn’t be happening, as Colab’s TPU memory capacity is capable of handling the 10GB T5 3billion. That’s with Tensorflow of course, but it shows what’s possible with the hardware provided.

I can actually successfully train Deberta-large purely on CUDA. But then going beyond large to the 1.7GB xlarge fails there as well.

It should be possible to train all versions of Deberta on Colab, as xxlarge is 2.9GB, only a third of the size of T5 3B.

Can anyone advise on how these memory issues can be overcome so Pytorch-centric models like Deberta can be trained on Colab?

Thanks!

Topic		Replies	Views
Accelerate / TPU with bigger models: process 0 terminated with signal SIGKILL 🤗Accelerate	2	3778	May 13, 2022
Colab error (memory crashes) Beginners	3	3082	April 22, 2021
Out of memory when fine-tuning bert on tpu 🤗Transformers	0	608	December 2, 2021
TPU Memory problem when saving model checkpoint Beginners	0	563	April 7, 2022
Colab RAM crash error - Fine-tuning RoBERTa in Colab Beginners	3	6551	December 15, 2020

TPU memory issues

Related topics