Training memory footprint depends on instantiating method

I’ve recently encountered a strange issue regarding the difference in GPU memory footprint during training.

I launched two training setups with identical parameters and random seeds. The only difference was the method used to instantiate the Hugging Face model:

Option 1:

model = ModelClass.from_pretrained(path_to_hub)

Option 2:

config = Config.from_pretrained(path_to_hub)
model = ModelClass(config)

Surprisingly, I was able to train the configuration initialized with Option 1 (from_pretrained) using a larger batch size than with Option 2 (from config), which resulted in a “CUDA out of memory” error when I attempted a larger batch.

I compared the model configurations and the dtype of the weights, but I couldn’t find any differences between the two options.

Why am I observing this effect? Is this expected behavior? Could it be hardware-specific or related to CUDA memory allocation? Has anyone else experienced this? I haven’t found any relevant information online.

P.S.

All training experiments (including all GPU communication) were based on PyTorch Lightning, which might be a relevant detail.

1 Like

I have not noticed the same experience, but I guess I have.
What I suspect in these cases would be torch and the accelerate library.
It is possible that accelerate is doing a good job of allocating RAM and VRAM, and torch may be using a meta tensor.
A bad pattern would be a bug in from_pretrained() or load_from_config() in the transformers library…

1 Like