Why is the memory quickly filled up in the first few iterations when using Trainer of transformers to train the network, and then drops to a very low level as the training progresses?

SysDisk · May 25, 2025, 10:33pm

I’m training the vision module of qwen2.5-vl-3B model and freeze the other module. The vision module is composed of transformer block.I use one 80GB A100 to train it. The max pixel of image is set to 1,048,576, which means about 5349 vision token as the input of vision module (the patch size is 14*14).
When the trainer started, the memory of gpu is quickly filled up in the first few iterations, actually the memory is 79000MiB/81920Mib. After a few minutes, memory usage droped quickly to 35000MiB and stabilized.
I want to know what causes this.

Topic		Replies	Views
Increasing VRAM Usage with Transformers Trainer Leads to OOM on GPUs 🤗Transformers	2	1056	March 29, 2024
Training out of memory 🤗Transformers	0	222	July 18, 2024
Training Loss Sudden Spike After 8 Hours of pre-training a BERT Model 🤗Transformers	0	1126	September 13, 2023
Extra GPU usage on custom Qwen2-VL 🤗Transformers	0	151	October 28, 2024
Why does per_device_train_batch_size have a severe impact on memory? Beginners	0	420	June 30, 2023

Why is the memory quickly filled up in the first few iterations when using Trainer of transformers to train the network, and then drops to a very low level as the training progresses?

Related topics