Evaluation became slower and slower during Trainer.train()

skpig · July 26, 2021, 6:58am

When I used Trainer.train() to fine-tune BartBase, I found something weird that the speed shown in progress bar became slower and slower (from 6 item/s to 0.29 item/s. Please help me, I’m new to transformers.

Here are my codes.

training_args = TrainingArguments(
    output_dir="Model/BartBase",
    overwrite_output_dir=True,
    
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=1e-5,
    num_train_epochs=20,
    lr_scheduler_type='linear',
    label_smoothing_factor=0,
    
#     logging_dir='runs',
    logging_strategy='steps', # log according to log_steps
    logging_steps=1,
    
    save_strategy='steps', # log according to save_steps
    save_steps=4000,
    save_total_limit=10, # limit the total amount of checkpoints
    
    evaluation_strategy="steps", # log according to eval_steps
    eval_steps=1, # I set eval_steps=1 to debug
    eval_accumulation_steps=1,
    
    seed=42, 
    
    load_best_model_at_end=True, # load best model according to metric_for_best_model
    metric_for_best_model='f1' # the string should be 
    )


from datasets import load_metric
import numpy as np

def compute_metrics(eval_pred):
    f1_metric = load_metric('f1')
    accuracy_metric = load_metric('accuracy')
    pred, label = eval_pred
    pred = np.argmax(pred, axis=-1)
    f1_score = f1_metric.compute(predictions=pred, references=label, average='micro')
    accuracy = accuracy_metric.compute(predictions=pred, references=label)
    return f1_socre.update(accuracy)


from transformers import Trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=collator, # if tokenizer is provided, no need to provide it explicitly
    
    train_dataset=train_dataset, # torch.utils.data.dataset.Dataset
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()

skpig · July 27, 2021, 8:46am

After debugging step by step, I found that

If I remove the compute_metrics=compute_metrics in Trainer, the evaluation went well.
Even if I use a quite simple compute_metrics, the evaluation became slow and stopped eventually (without finishing progress) .
```
def compute_metrics(eval_pred): 
      return {'f1': 1}
```

Please give me some helps. Thanks a lot!!!

Qianyi22 · August 12, 2022, 7:21pm

Hi,
I am also experiencing speed issues.
Did you solve it now？

ljyflores · September 12, 2022, 9:16pm

I was initially experiencing slower and slower performance, but upon setting predict_with_generate to True, each batch trained with the same speed!

simoncks1994 · November 17, 2022, 1:22am

I experience the same issue.

Yaobeauty · November 20, 2023, 5:35am

I am experinencing the same issue!!! Anyway to fix it??

Yaobeauty · November 20, 2023, 7:09am

! I think I found a way to solve it!
According to https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941/2, I think the probably reason is that “When computing metrics inside the Trainer , your predictions are all gathered together on the device (GPU/TPU) and only passed back to the CPU at the end (because that operation can be slow).”

But when computing we do not need all the logits (just the largest one’s idx). So I solve the problem by introducing with preprocess_logits_for_metrics function:

    def compute_metrics_acc(tokenizer):
        def compute_metric(eval_preds):
            preds, targets = eval_preds
            preds= np.where(preds != -100, preds, tokenizer.pad_token_id)
            targets= np.where(targets != -100, targets, tokenizer.pad_token_id)
            preds = tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            targets = tokenizer.batch_decode(targets, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            correct = 0
            assert len(preds) == len(targets)
            for idx, pred in enumerate(preds):
                reference = targets[idx]
                reference = extract_ans(reference)
                extract_pred = extract_ans(pred)
                best_option = extract_pred
                if reference == best_option and reference != False:
                    correct +=1 
            return {'accuracy': 1.0*correct/len(targets)}
        return compute_metric

def preprocess_logits_for_metrics(logits, labels):
        """
        Original Trainer may have a memory leak. 
        This is a workaround to avoid storing too many tensors that are not needed.
        """
        pred_ids = torch.argmax(logits, dim=-1)
        return pred_ids

and pass it to the trainer.
I left my trainer setup here:

    trainer = SFTTrainer(
        model=base_model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=peft_config,
        packing=script_args.packing,
        max_seq_length=1024,
        tokenizer=tokenizer,
        args=training_args,
        data_collator=collator,
        compute_metrics=compute_metric,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
        formatting_func=prepare_sample_text,
    )

dmytromishkin · April 22, 2024, 7:41am

I am training the semantic segmentation model (Mask2Former) and hitting exactly this problem.
preprocess_logits_for_metrics helps a bit, but doesn’t solve it. Overall, having to predict the whole dataset prior to calculating the metric seems a strange choice to me…

Dingyun-Huang · February 3, 2025, 11:05am

Hi there,

I have met the same issue using the official run_qa.py script provided in transformers 4.40. For me, the problem happened exclusively for my fine-tuned RoBERTa-base. RoBERTa-large and BERT-base-uncased were okay. I am really confused by this behaviour.

My work-around on this is setting the --eval_do_concat_batches to False, which prevent the evaluation loop from saving all logits in memory until evaluating all batches. Since this will breakdown the output.predictions into a list of lists of predictions, you will need to concatenate them before passing it into the metric function. My hack in the trainer_qa.py file is

predictions = [[], [], []]
for t in output.predictions:
    for i, item in enumerate(t):
        predictions[i].append(item)
# save and concat the start and end logits
predictions = [np.concatenate(item) for item in predictions[:2]]

The method prevent my evaluation from running slower and slower after each batch.

Topic		Replies	Views
Evaluation step very slow 🤗Transformers	1	835	February 21, 2024
Facebook BART Fine-tuning - Transformers CUDA error: CUBLAS_STATUS_NOT_INITIALIZE 🤗Transformers	4	1761	May 2, 2023
BART-base generating completely wrong output after training for more than 3 epochs Intermediate	0	854	July 8, 2021
Evaluate a fine-tune zero-shot Facebook model error Beginners	0	310	May 17, 2023
Seq2SeqTrainer downloads different model on evaluation Beginners	0	278	August 18, 2021

Evaluation became slower and slower during Trainer.train()

Related topics