Hello, Hugging Face community,
I’m encountering a concerning issue while training a model using the Transformers Trainer class. The problem is with the GPU VRAM usage, which not only steadily increases over time but also does not decrease after it has increased.
Initially, the training starts with 23GB allocated across 5 GPUs, but as the training progresses, the VRAM usage keeps climbing. After about 2-3 hours of training (approximately 20,000 iterations at a batch size of 32), the VRAM usage maxes out at 80GB, leading to an Out-Of-Memory (OOM) error on the GPUs.
I’m not using Peft. The model in question is LLaMA2. For the training, gradients are enabled only for the image embedding projector, special token-related embedding layers, and the LM head.
Below is the relevant code snippet:
At train.py
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path="/PATH/TO/CKPT",
torch_dtype=torch_dtype, # bfloat16
low_cpu_mem_usage=True,
device_map=device_map,
)
model.resize_token_embeddings(len(tokenizer))
def freeze_partial_embedding_hook(grad):
grad[:32000] = 0
return grad
def freeze_partial_lmhead_hook(grad):
grad[:32000] = 0
return grad
def set_requires_grad(model):
for name, param in model.named_parameters():
if ("lm_head" in name or "embed_tokens" in name or "projector" in name):
param.requires_grad = True
if "embed_tokens" in name:
param.register_hook(freeze_partial_embedding_hook)
if "lm_head" in name:
param.register_hook(freeze_partial_lmhead_hook)
else:
param.requires_grad = False
set_requires_grad(model)
And this is how my training script is structured:
CUDA_VISIBLE_DEVICES="0,1,2,3,4" torchrun --nnodes 1 --nproc_per_node 5 train_pt.py \
--output_dir ../saved_models/pt-book-ST5 \
--logging_strategy steps \
--logging_steps 10 \
--logging_first_step True \
--save_strategy steps \
--save_total_limit 3 \
--save_steps 5000 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--data_name_or_path bookcorpus\
--preprocessing_num_workers 16 \
--warmup_ratio 0.03 \
--weight_decay 0 \
--lr_scheduler_type cosine\
--ddp_find_unused_parameters False\
--do_eval False\
--evaluation_strategy no\
--overwrite_output_dir True\
--remove_unused_columns False\
--gradient_checkpointing True
Has anyone experienced a similar issue with VRAM usage increasing over time during training? I’m curious about any insights or solutions to prevent the GPU from running into OOM, especially in long training sessions. I’ve checked for common memory leaks sources in PyTorch, but the gradual increase in VRAM seems tied to the training process itself.
Additionally, if anyone knows how to effectively use the garbage collector while utilizing the Trainer, I would greatly appreciate any advice or insights on that matter!
Thank you in advance for your help!