CUDA out of memory when using Trainer with compute_metrics

Randool · December 23, 2020, 4:14pm

Recently, I want to fine-tuning Bart-base with Transformers (version 4.1.1). The fine-tuning process is very smooth with compute_metrics=None in Trainer. However, when I implement a function of computing metrics and offer this function to Trainer, I received the CUDA out of memory error during the evaluation stage.

I want to try this feature, so my implementation is straightforward.

def compute_metrics(pred):
    preds = pred.predictions
    labels = pred.label_ids
    print(preds.shape, labels.shape)
    return {
        'loss': 1
    }

Because when compute_metrics=None, the training process is normal, so I think it can’t be the problem of batch size. But I still tried a smaller batch size, but even if I set the batch size to 1, the situation is the same. I even try tinier-bart, but I received the same error at last.

One thing caught my attention . When I set a tiny batch size, the memory will not fill up at once, but the occupancy rate has increased until the memory can’t hold more data. That is, the processed data is not released in time. Is there any magic operation to solve this problem?

sgugger · December 23, 2020, 5:51pm

When computing metrics inside the Trainer, your predictions are all gathered together on the device (GPU/TPU) and only passed back to the CPU at the end (because that operation can be slow). If your dataset is large (or your model outputs large predictions) you can use eval_accumulation_steps to set a number of steps after which your predictions are sent back to the CPU (slower but uses less device memory). This should avoid your OOM.

Randool · December 24, 2020, 2:41am

I have tried to use eval_accumulation_steps, but another problem occurs.

Here is part of my fine-tuning code:

args = TrainingArguments(
    output_dir="exp/bart/results",
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=1000,
    logging_dir="exp/bart/logs",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    eval_accumulation_steps=1,
)

trainer = Trainer(
    model=bart,
    args=args,
    data_collator=collate_fn,
    train_dataset=train_set,
    eval_dataset=eval_set,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Besides, the number of lines in the evaluation set is 22,161. When I set eval_accumulation_steps=1, I receive:

MemoryError: Unable to allocate 149. GiB for an array with shape (22162, 36, 50265) and data type float32

It means that the Trainer will apply for all the space at once. Do I need to set other parameters to ensure that the Trainer applies for the right space every time?

sgugger · January 4, 2021, 12:58pm

This error means you are trying to get predictions that just don’t fit in RAM, so there is nothing Trainer can do to help. I don’t know which bart models you’re using, but it looks like you have huge logits so you should split your evaluation dataset in small parts or use a custom evaluation loop.

NicoHambauer · February 15, 2022, 12:26pm

Thank you for this answer! It helped me a lot!

IamAdiSri · April 8, 2022, 5:31am

Hi, I’m still struggling with this issue. I’m trying to finetune a Bart model and while I can get it to train, I always run out of memory during the evaluation phase. This does not happen when I don’t use compute_metrics, so I think there’s an issue there - when I don’t use compute_metrics I can run batch sizes of up to 16, however on using compute metrics, I can’t even use a batch size of 1 even with eval accumulation.

Could you please explain why compute metrics is so much heavier when I can run training and evaluation without issues otherwise? In your answer above you mentioned that the trainer holds all predictions on the GPU but why is this being done for metrics calculation?

I have used Fairseq for seq2seq tasks with similarly sized models before this, and have never run into this issue before, so I was also wondering if they do metrics computation differently.

Dusan · April 11, 2022, 6:51pm

I also experience this when including my own compute_metrics implementation, and it gradually increases GPU memory occupation over time.

Could it be that data structures (tensors I assume) used in our own implementation with each estimation are filling up GPU space and this is overloading our GPU device, and somehow default implementation is using memory garbage collector better? Should we somehow dump memory from variables not being used anymore over time? It seems variables are not being dumped after the compute_metrics() function is done? @sgugger

kashif09 · July 25, 2022, 3:26pm

i was getting this same error so now i am doing this eval_accumulation_steps=16 with per_device_eval_batch_size=1, but i am now getting error “your ram collapsed bcoz you have used up the available ram” in google colab, any more help will be appreciated, i am using colab pro with 15 GB available GPU and model size 2.12 GB Pegasus,dataset Dialogsum, my train batch size is also 2

kashif09 · July 25, 2022, 4:04pm

were you able to solve your problem?

morenolq · September 18, 2022, 3:26pm

I think it should be some leakage somewhere, it seems just not feasible that training works while eval is not.

The problem might be storing everything or large variables before computing the custom compute_metrics. Is this something you are planning to investigate?

Randool · September 18, 2022, 3:46pm

Yes, and I have found that compute_metrics will save all the logits before move them from CUDA to memory, just as sgugger said:

Sorry for forgetting mark this question as SOLVED

morenolq · September 18, 2022, 4:16pm

I don’t think it solves the issue, it only moves it to RAM instead of GPU. The real solution is introduced with preprocess_logits_for_metrics function (here).

I leave here my specific solution (both functions):

def compute_metrics(pred):

    labels_ids = pred.label_ids
    pred_ids = pred.predictions[0]

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str,
        references=label_str,
        rouge_types=["rouge1", "rouge2", "rougeL", "rougeLsum"],
    )

    return {
        "R1": round(rouge_output["rouge1"], 4),
        "R2": round(rouge_output["rouge2"], 4),
        "RL": round(rouge_output["rougeL"], 4),
        "RLsum": round(rouge_output["rougeLsum"], 4),
    }

def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak. 
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

BTW, proceeding in this way, you may not need to use eval_accumulation_steps=1 (that slows down the evaluation significantly).

Randool · September 19, 2022, 3:41am

Thank you very much! This should be the optimal solution.

AndyReas · November 14, 2022, 4:22pm

What is the reason for only using the first element of logits and predictions?

AndyReas · November 22, 2022, 3:08pm

I figured it out, maybe. Whatever is returned in preprocess_logits_for_metrics(logits, labels) will be available in pred.predictions inside compute_metrics(pred). By returning (pred_ids, labels), pred.predictions becomes a tuple, where the second element is the labels. But the labels will still be available in pred.label_ids no matter what. So only the predictions need to be returned by preprocess_logits_for_metrics, and in that case, we don’t need to access the predicitons at index 0. As for only performing argmax on logits[0], I still don’t understand. Would that not just take the first element in the batch, and ignore the rest?

Let me know if I’m wrong

abuendia · November 25, 2022, 9:38pm

I think your understanding is correct on both. logits[0] would only take the first sequence in the batch assuming all other training args are left to their defaults – I modified above snippet to remove the subscript.

morenolq · December 5, 2022, 6:00pm

Sorry for the late reply, Actually I used logits[0] because the version I was using was passing both logits and labels in logits (maybe I was also misinterpreting them). I agree with you, the correct snippet should be with logits instead of logits[0] in the argmax.

pred_ids = torch.argmax(logits, dim=-1)

ghh001 · April 28, 2023, 3:04am

Yes, I solved the “CUDA out of memory when using Trainer with comput_metrics” by using your solution. Thank you so much.

seba3y · August 26, 2023, 10:35am

in my case, I’m working on speechT5ASR and the logits are tuple of 2 items, the first is the decoder output (logits that i need) and second is the encoder last hidden state. so l work with logits[0]

averoo · September 26, 2023, 7:22am

Thank you! Fixed my problem.

Topic		Replies	Views
CUDA Out Of Memory when training a DETR Object detection model with compute_metrics 🤗Transformers	3	114	July 17, 2025
Adding compute_metrics produces Cuda OutOfMemoryError Beginners	0	127	May 22, 2024
Evaluation became slower and slower during Trainer.train() Beginners	8	4643	February 3, 2025
Transformer Trainer no response when evaluate with compute_metrics 🤗Transformers	1	165	September 12, 2024
Compute metrics causes OOM Beginners	2	257	February 3, 2024

CUDA out of memory when using Trainer with compute_metrics

Related topics