Huge disparity between CPU and GPU memory usage?

So I’ve got this T5-large training pipeline going, on a CPU-only VM with 16GB of total memory, and with optimization of both gradient_accumulation_steps and using gradient_checkpointing = True, I can get training started with a train batch size of 32 and eval batch size of 32.

However on Colab Pro+ with 55GB of system memory and a V100 GPU with 16GB and using the same hyperparams, I can’t even get above a batch size of 2…?

Any thoughts?