Compute_metrics slowdown

Evaluation during trainer.train() slowdowns with user-defined compute_metrics function. Cannot even complete the whole evaluation.

Dropping compute_metrics and setting prediction_loss_only=True, dramatically speeds it up. Should it be better if we provide it with a Metric instead?

Can you share a bit more information on what kind of metrics you are computing and what training script you are using?

Thanks for following up! The metric is a custom function I wrote that is not a Metric class instance. I am using Seq2SeqTrainer.

Let me know if you want me to copy the script here.

Yes, a minimal example would be great, otherwise it’s impossible to know what could cause the slowdown (e.g. large evaluation set, slow custom metric function, bug in the transformers code).

Here is a minimal example.

For the evaluation function,

def perplexity_from_logits(
        logits: torch.FloatTensor,
        labels: torch.IntTensor,
        shift: bool = True,
        normalize: bool = True,
) -> Union[float, List[float]]:
    if shift:
        labels = labels[..., 1:]
        logits = logits[..., :-1, :]

    with torch.no_grad():
        if normalize:
            perplexity = torch.exp(
                cross_entropy(logits.permute(0, 2, 1), labels)
            perplexity = torch.exp(
                cross_entropy(logits.permute(0, 2, 1), labels, reduction='none')

    return perplexity

def compute_metric(eval_preds: EvalPrediction) -> Dict[str, Any]:
    (logits, hidden_states), labels = eval_preds
    return {'perplexity': perplexity_from_logits(logits=logits, labels=labels)}

For the training parameters,

model = T5ForConditionalGeneration.from_pretrained('t5-small')
training_args = Seq2SeqTrainingArguments(

trainer = Seq2SeqTrainer(

In my use case, dataset_val has 249459 samples. The dataset is too complicated and large to share here.

Since the evaluation slowdowns as it progresses, it might either be large evaluation set or memory leakage (unlikely) coming into play. Any insight into working on large evaluation set would be helpful. :hugs:

Running generation on 250k samples during evaluation is extremely expensive. I would run it on a smaller subsample (e.g. 1k) and see if that improves the speed.

Unfortunately given the diverse sample space, I still prefer to evaluate all 250k samples. I don’t mind the computing time on evaluation (estimated to be 3h on single sample inference) but the slowdown leads to >24h runtime. Do you have any suggestions/insights (for example cache, etc)?

The main question is do the evaluation scores vary more if you sample e.g. 25k (I suspect this is still too many for monitoring training performance) of samples? In most cases that should give you already give you a solid estimation which is usually enough to monitor training and you can still do a full evaluation at the end of training.

Caching seems not really an option since your model changes during training and thus the predictions vary each time. Not sure what you would cache in that case.