CUDA out of memory when using Trainer with compute_metrics

KhaiKit · October 5, 2023, 3:49am

Hi

I have a related problem in view of what you mentioned here.

I am currently finetuning the NLLB translation model using GPU where I like to compute metrics and see the progress of the training process as it trains. The problem I face is that when I increase my dataset to approximately 50K (followed by a 0.2 train-test split), my trainer seems to be able to complete 1 epoch within 9mins but only completes the evaluation for the epochs 20 mins later.

I used simple print statements in my compute_metrics function and realised that the whole function ran in less than a minute, so I’m not so sure what went wrong. Is there something wrong with my compute_metrics function?

I understand that tokenization happens on the CPU? So I was wondering if the problem I face is because I am training on GPU and then evaluating on CPU? 20mins of evaluation doesnt seem like a big problem but when I increased my dataset to 190K, the trainer would have completed 1 epoch in 30mins but not even complete the evaluation 70mins later.

Here are some of the codes i use:

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    logger.warning(result)
    return result

training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

SimpleJerry · June 16, 2024, 6:54am

Thank you so much. Your answer solved my problem.

sc3051 · August 2, 2024, 3:37pm

morenolq:

def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak. 
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

This solves the exact problem for me too. Why nobody from huggingface integrate this directly into the Trainer? It took me so long to find this solution after trying so many other things…

thistlillo · December 13, 2024, 9:45am

I do not understand what you said.

My Trainer is the following

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding="longest", label_pad_token_id=-100)
evaluator = QGenMetrics(tokenizer)
trainer = Seq2SeqTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    data_collator=data_collator,
    # compute_metrics=evaluator.compute_metrics_validation,
    args=Seq2SeqTrainingArguments(
        output_dir="./_remove",
        gradient_accumulation_steps=1,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=1,
        seed=1,
        data_seed=2,
        predict_with_generate=True,
        eval_strategy="epoch",
        report_to="none"
    ) #< training args
) #< trainer

The model is a T5-small that does not occupy plenty of space.

The trainer above occupies all the GPU ram it finds. In my case it fills 80GB. When training ends, the memory is not deallocated.

I changed this project from pure pytorch to HuggingFace classes and it is very hard to understand everything that is happening on the GPU right now.

Sin2pi · December 13, 2024, 11:49pm

This is what worked for me when I was adapting my asr model architecture to hugging face trainer API:

def compute_metrics(pred):
pred_logits = pred.predictions
label_ids = pred.label_ids

if isinstance(pred_logits, tuple):
    pred_ids = pred_logits[0]
else:
    pred_ids = pred_logits
if pred_ids.ndim == 3:
    pred_ids = np.argmax(pred_ids, axis=-1)

label_ids[label_ids == -100] = tokenizer.pad_token_id
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
cer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"cer": cer}

So that the code can handle both tuple and non-tuple formats of pred_logits.
And then for 3-dimensional logits, it converts them into a 2-dimensional array of predicted class IDs by taking the argmax over the class dimension.

vishakha-lall · June 25, 2025, 4:57am

For anyone stuck with this problem in the case of Vision Transformers, here’s the corresponding function.

def preprocess_logits_for_metrics_fn(logits_tuple, labels):
     # Unpack logits tuple
    cls_logits = logits_tuple[1]
    box_preds = logits_tuple[2]

    # Detach and move to CPU (important for memory and multiprocessing)
    cls_logits = cls_logits.detach().cpu()
    box_preds = box_preds.detach().cpu()

    return (cls_logits, box_preds), labels

to be used in the Trainer as

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["val"],
    processing_class=image_processor,
    data_collator=collate_fn,
    compute_metrics=eval_compute_metrics_fn,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics_fn,
)

tahmid1234 · September 19, 2025, 4:43am

morenolq:

I don’t think it solves the issue, it only moves it to RAM instead of GPU. The real solution is introduced with preprocess_logits_for_metrics function (here).

I leave here my specific solution (both functions):

def compute_metrics(pred):

    labels_ids = pred.label_ids
    pred_ids = pred.predictions[0]

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str,
        references=label_str,
        rouge_types=["rouge1", "rouge2", "rougeL", "rougeLsum"],
    )

    return {
        "R1": round(rouge_output["rouge1"], 4),
        "R2": round(rouge_output["rouge2"], 4),
        "RL": round(rouge_output["rougeL"], 4),
        "RLsum": round(rouge_output["rougeLsum"], 4),
    }

def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak. 
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

BTW, proceeding in this way, you may not need to use eval_accumulation_steps=1 (that slows down the evaluation significantly).

But will this work if compute_metrics calculates AUPR or AUROC? If I use sigmoid, will I stop getting this error?

Topic		Replies	Views
Transformer Trainer no response when evaluate with compute_metrics 🤗Transformers	1	178	September 12, 2024
Adding compute_metrics produces Cuda OutOfMemoryError Beginners	0	129	May 22, 2024
CUDA Out Of Memory when training a DETR Object detection model with compute_metrics 🤗Transformers	3	131	July 17, 2025
Evaluation became slower and slower during Trainer.train() Beginners	8	4702	February 3, 2025
Compute metrics causes OOM Beginners	2	264	February 3, 2024

CUDA out of memory when using Trainer with compute_metrics

Related topics