Training memory footprint depends on instantiating method

vladkoz · October 23, 2024, 3:44pm

I’ve recently encountered a strange issue regarding the difference in GPU memory footprint during training.

I launched two training setups with identical parameters and random seeds. The only difference was the method used to instantiate the Hugging Face model:

Option 1:

model = ModelClass.from_pretrained(path_to_hub)

Option 2:

config = Config.from_pretrained(path_to_hub)
model = ModelClass(config)

Surprisingly, I was able to train the configuration initialized with Option 1 (from_pretrained) using a larger batch size than with Option 2 (from config), which resulted in a “CUDA out of memory” error when I attempted a larger batch.

I compared the model configurations and the dtype of the weights, but I couldn’t find any differences between the two options.

Why am I observing this effect? Is this expected behavior? Could it be hardware-specific or related to CUDA memory allocation? Has anyone else experienced this? I haven’t found any relevant information online.

P.S.

All training experiments (including all GPU communication) were based on PyTorch Lightning, which might be a relevant detail.

John6666 · October 23, 2024, 4:04pm

I have not noticed the same experience, but I guess I have.
What I suspect in these cases would be torch and the accelerate library.
It is possible that accelerate is doing a good job of allocating RAM and VRAM, and torch may be using a meta tensor.
A bad pattern would be a bug in from_pretrained() or load_from_config() in the transformers library…

Topic		Replies	Views
Inflated GPU memory footprint of model prepared via accelerate 🤗Accelerate	5	764	September 15, 2023
Missmatch between memory-estimate and Trainer-API Beginners	0	182	January 23, 2024
Multi GPU Training with Trainer and TokenClassification Model 🤗Transformers	0	1520	July 21, 2023
Can't load huge model onto multiple GPU's Beginners	5	5200	June 15, 2023
CUDA out of memory when using the trainer model_init 🤗Transformers	0	248	December 31, 2023

Training memory footprint depends on instantiating method

Related topics