I am a newbie in DL and I am training my first XLM-Roberta (base) style language model using Trainer API and TPUs. In my case, the Trainer API is slightly customized to incorporate batch sampler. I started training the language model using Google colab and everything worked fine. The RAM usage was never more than 8 or 9 GB. Over time, the usage started to grow significantly. On last training, resuming from last checkpoint, it required around 55 GB of RAM while today it required 34 GB.
I don’t even know if this is normal or not. However, to my understanding this is not normal at all. The problem is I don’t even know how to troubleshoot this problem. Can anybody guide me please to solve this problem?