Huge disparity between CPU and GPU memory usage?

jrandel · February 22, 2022, 9:24am

So I’ve got this T5-large training pipeline going, on a CPU-only VM with 16GB of total memory, and with optimization of both gradient_accumulation_steps and using gradient_checkpointing = True, I can get training started with a train batch size of 32 and eval batch size of 32.

However on Colab Pro+ with 55GB of system memory and a V100 GPU with 16GB and using the same hyperparams, I can’t even get above a batch size of 2…?

Any thoughts?

Topic		Replies	Views
System RAM gets full in sometime and ( VideoMAE ) training job is killed 🤗Transformers	0	65	May 21, 2024
Why does all my gpu memory get used with a small model? Beginners	5	2145	March 13, 2022
Colab error (memory crashes) Beginners	3	3063	April 22, 2021
Bigger batch size, the lower throughput and GPU usage？ 🤗Transformers	1	636	July 16, 2022
Always getting RuntimeError: CUDA out of memory with Trainer 🤗Transformers	10	6921	April 4, 2024

Huge disparity between CPU and GPU memory usage?

Related topics