Increasing VRAM Usage with Transformers Trainer Leads to OOM on GPUs

Hello, Hugging Face community,

I’m encountering a concerning issue while training a model using the Transformers Trainer class. The problem is with the GPU VRAM usage, which not only steadily increases over time but also does not decrease after it has increased.
Initially, the training starts with 23GB allocated across 5 GPUs, but as the training progresses, the VRAM usage keeps climbing. After about 2-3 hours of training (approximately 20,000 iterations at a batch size of 32), the VRAM usage maxes out at 80GB, leading to an Out-Of-Memory (OOM) error on the GPUs.

I’m not using Peft. The model in question is LLaMA2. For the training, gradients are enabled only for the image embedding projector, special token-related embedding layers, and the LM head.

Below is the relevant code snippet:

At train.py

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path="/PATH/TO/CKPT",
    torch_dtype=torch_dtype, # bfloat16
    low_cpu_mem_usage=True,
    device_map=device_map,
)
model.resize_token_embeddings(len(tokenizer))
def freeze_partial_embedding_hook(grad):
    grad[:32000] = 0
    return grad

def freeze_partial_lmhead_hook(grad):
    grad[:32000] = 0
    return grad


def set_requires_grad(model):
    for name, param in model.named_parameters():
        if ("lm_head" in name or "embed_tokens" in name or "projector" in name):
            param.requires_grad = True
            if "embed_tokens" in name:
                param.register_hook(freeze_partial_embedding_hook)
            if "lm_head" in name:
                param.register_hook(freeze_partial_lmhead_hook)
        else:
            param.requires_grad = False

set_requires_grad(model)

And this is how my training script is structured:

CUDA_VISIBLE_DEVICES="0,1,2,3,4" torchrun --nnodes 1 --nproc_per_node 5 train_pt.py \
    --output_dir ../saved_models/pt-book-ST5 \
    --logging_strategy steps \
    --logging_steps 10 \
    --logging_first_step True \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 5000 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --data_name_or_path bookcorpus\
    --preprocessing_num_workers 16 \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --lr_scheduler_type cosine\
    --ddp_find_unused_parameters False\
    --do_eval False\
    --evaluation_strategy no\
    --overwrite_output_dir True\
    --remove_unused_columns False\
    --gradient_checkpointing True

Has anyone experienced a similar issue with VRAM usage increasing over time during training? I’m curious about any insights or solutions to prevent the GPU from running into OOM, especially in long training sessions. I’ve checked for common memory leaks sources in PyTorch, but the gradual increase in VRAM seems tied to the training process itself.

Additionally, if anyone knows how to effectively use the garbage collector while utilizing the Trainer, I would greatly appreciate any advice or insights on that matter!

Thank you in advance for your help!

My first thought was that it sounded like your evaluation strategy must’ve added VRAM utilization, but I see in your configuration that you are setting those options to false. However, I see that per_device_eval_batch_size is set to 1, so maybe double check that there is not evaluation happening (from my experience, the evaluation always happens on a single GPU, as opposed to parallelized across multiple GPUs, and so having the evaluation strategy in place caused OOM for me).

From my experience, after the first batch completes, if you did not get an OOM error, then you’re good to go for all subsequent batches.

Wish I was more help - but good luck!

Thank you for your response.

I’ve experimented with it and the evaluation step doesn’t happen during training. I think there is probably something else wrong.