CUDA out of memory when using Trainer with compute_metrics

Hi

I have a related problem in view of what you mentioned here.

I am currently finetuning the NLLB translation model using GPU where I like to compute metrics and see the progress of the training process as it trains. The problem I face is that when I increase my dataset to approximately 50K (followed by a 0.2 train-test split), my trainer seems to be able to complete 1 epoch within 9mins but only completes the evaluation for the epochs 20 mins later.

I used simple print statements in my compute_metrics function and realised that the whole function ran in less than a minute, so I’m not so sure what went wrong. Is there something wrong with my compute_metrics function?

I understand that tokenization happens on the CPU? So I was wondering if the problem I face is because I am training on GPU and then evaluating on CPU? 20mins of evaluation doesnt seem like a big problem but when I increased my dataset to 190K, the trainer would have completed 1 epoch in 30mins but not even complete the evaluation 70mins later.

Here are some of the codes i use:

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    logger.warning(result)
    return result
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Thank you so much. Your answer solved my problem.

1 Like

This solves the exact problem for me too. Why nobody from huggingface integrate this directly into the Trainer? It took me so long to find this solution after trying so many other things…

3 Likes

I do not understand what you said.

My Trainer is the following

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding="longest", label_pad_token_id=-100)
evaluator = QGenMetrics(tokenizer)
trainer = Seq2SeqTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    data_collator=data_collator,
    # compute_metrics=evaluator.compute_metrics_validation,
    args=Seq2SeqTrainingArguments(
        output_dir="./_remove",
        gradient_accumulation_steps=1,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=1,
        seed=1,
        data_seed=2,
        predict_with_generate=True,
        eval_strategy="epoch",
        report_to="none"
    ) #< training args
) #< trainer

The model is a T5-small that does not occupy plenty of space.

The trainer above occupies all the GPU ram it finds. In my case it fills 80GB. When training ends, the memory is not deallocated.

I changed this project from pure pytorch to HuggingFace classes and it is very hard to understand everything that is happening on the GPU right now.

1 Like

This is what worked for me when I was adapting my asr model architecture to hugging face trainer API:

def compute_metrics(pred):
pred_logits = pred.predictions
label_ids = pred.label_ids

if isinstance(pred_logits, tuple):
    pred_ids = pred_logits[0]
else:
    pred_ids = pred_logits
if pred_ids.ndim == 3:
    pred_ids = np.argmax(pred_ids, axis=-1)

label_ids[label_ids == -100] = tokenizer.pad_token_id
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
cer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"cer": cer}
  1. So that the code can handle both tuple and non-tuple formats of pred_logits.

  2. And then for 3-dimensional logits, it converts them into a 2-dimensional array of predicted class IDs by taking the argmax over the class dimension.

1 Like