I’ve recently encountered a strange issue regarding the difference in GPU memory footprint during training.
I launched two training setups with identical parameters and random seeds. The only difference was the method used to instantiate the Hugging Face model:
Option 1:
model = ModelClass.from_pretrained(path_to_hub)
Option 2:
config = Config.from_pretrained(path_to_hub)
model = ModelClass(config)
Surprisingly, I was able to train the configuration initialized with Option 1 (from_pretrained) using a larger batch size than with Option 2 (from config), which resulted in a “CUDA out of memory” error when I attempted a larger batch.
I compared the model configurations and the dtype
of the weights, but I couldn’t find any differences between the two options.
Why am I observing this effect? Is this expected behavior? Could it be hardware-specific or related to CUDA memory allocation? Has anyone else experienced this? I haven’t found any relevant information online.
P.S.
All training experiments (including all GPU communication) were based on PyTorch Lightning, which might be a relevant detail.