TPU memory issues

Hi! I’m trying to run a modified version of the Accelerate notebook with a Deberta-large and xlarge model.

Here’s the notebook - Google Colaboratory

I’m experiencing severe memory issues. Even with a batch size of 1, I’m getting an explosion of observed memory usage and then Sigkill errors.

Deberta-base works fine, as do other small models. But, going up from models in the base range (~500MB) to the large range and beyond (1.5GB+), results in these memory issues.

This really shouldn’t be happening, as Colab’s TPU memory capacity is capable of handling the 10GB T5 3billion. That’s with Tensorflow of course, but it shows what’s possible with the hardware provided.

I can actually successfully train Deberta-large purely on CUDA. But then going beyond large to the 1.7GB xlarge fails there as well.

It should be possible to train all versions of Deberta on Colab, as xxlarge is 2.9GB, only a third of the size of T5 3B.

Can anyone advise on how these memory issues can be overcome so Pytorch-centric models like Deberta can be trained on Colab?

Thanks!

1 Like