Evaluation became slower and slower during Trainer.train()

When I used Trainer.train() to fine-tune BartBase, I found something weird that the speed shown in progress bar became slower and slower (from 6 item/s to 0.29 item/s. Please help me, I’m new to transformers.

Here are my codes.

training_args = TrainingArguments(
    output_dir="Model/BartBase",
    overwrite_output_dir=True,
    
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=1e-5,
    num_train_epochs=20,
    lr_scheduler_type='linear',
    label_smoothing_factor=0,
    
#     logging_dir='runs',
    logging_strategy='steps', # log according to log_steps
    logging_steps=1,
    
    save_strategy='steps', # log according to save_steps
    save_steps=4000,
    save_total_limit=10, # limit the total amount of checkpoints
    
    evaluation_strategy="steps", # log according to eval_steps
    eval_steps=1, # I set eval_steps=1 to debug
    eval_accumulation_steps=1,
    
    seed=42, 
    
    load_best_model_at_end=True, # load best model according to metric_for_best_model
    metric_for_best_model='f1' # the string should be 
    )


from datasets import load_metric
import numpy as np

def compute_metrics(eval_pred):
    f1_metric = load_metric('f1')
    accuracy_metric = load_metric('accuracy')
    pred, label = eval_pred
    pred = np.argmax(pred, axis=-1)
    f1_score = f1_metric.compute(predictions=pred, references=label, average='micro')
    accuracy = accuracy_metric.compute(predictions=pred, references=label)
    return f1_socre.update(accuracy)


from transformers import Trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=collator, # if tokenizer is provided, no need to provide it explicitly
    
    train_dataset=train_dataset, # torch.utils.data.dataset.Dataset
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
3 Likes

After debugging step by step, I found that

  1. If I remove the compute_metrics=compute_metrics in Trainer, the evaluation went well.
  2. Even if I use a quite simple compute_metrics, the evaluation became slow and stopped eventually (without finishing progress) .
    def compute_metrics(eval_pred): 
          return {'f1': 1}
    

Please give me some helps. Thanks a lot!!! :pray:

4 Likes

Hi,
I am also experiencing speed issues.
Did you solve it now?:slight_smile:

I was initially experiencing slower and slower performance, but upon setting predict_with_generate to True, each batch trained with the same speed!

I experience the same issue.

1 Like

I am experinencing the same issue!!! Anyway to fix it??

! I think I found a way to solve it! :grin:
According to https://discuss.huggingface.co/t/cuda-out-of-memory-when-using-trainer-with-compute-metrics/2941/2, I think the probably reason is that “When computing metrics inside the Trainer , your predictions are all gathered together on the device (GPU/TPU) and only passed back to the CPU at the end (because that operation can be slow).”

But when computing we do not need all the logits (just the largest one’s idx). So I solve the problem by introducing with preprocess_logits_for_metrics function:

    def compute_metrics_acc(tokenizer):
        def compute_metric(eval_preds):
            preds, targets = eval_preds
            preds= np.where(preds != -100, preds, tokenizer.pad_token_id)
            targets= np.where(targets != -100, targets, tokenizer.pad_token_id)
            preds = tokenizer.batch_decode(preds, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            targets = tokenizer.batch_decode(targets, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            correct = 0
            assert len(preds) == len(targets)
            for idx, pred in enumerate(preds):
                reference = targets[idx]
                reference = extract_ans(reference)
                extract_pred = extract_ans(pred)
                best_option = extract_pred
                if reference == best_option and reference != False:
                    correct +=1 
            return {'accuracy': 1.0*correct/len(targets)}
        return compute_metric

def preprocess_logits_for_metrics(logits, labels):
        """
        Original Trainer may have a memory leak. 
        This is a workaround to avoid storing too many tensors that are not needed.
        """
        pred_ids = torch.argmax(logits, dim=-1)
        return pred_ids

and pass it to the trainer.
I left my trainer setup here:

    trainer = SFTTrainer(
        model=base_model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=peft_config,
        packing=script_args.packing,
        max_seq_length=1024,
        tokenizer=tokenizer,
        args=training_args,
        data_collator=collator,
        compute_metrics=compute_metric,
        preprocess_logits_for_metrics=preprocess_logits_for_metrics,
        formatting_func=prepare_sample_text,
    )
1 Like

I am training the semantic segmentation model (Mask2Former) and hitting exactly this problem.
preprocess_logits_for_metrics helps a bit, but doesn’t solve it. Overall, having to predict the whole dataset prior to calculating the metric seems a strange choice to me…

1 Like

Hi there,

I have met the same issue using the official run_qa.py script provided in transformers 4.40. For me, the problem happened exclusively for my fine-tuned RoBERTa-base. RoBERTa-large and BERT-base-uncased were okay. I am really confused by this behaviour.

My work-around on this is setting the --eval_do_concat_batches to False, which prevent the evaluation loop from saving all logits in memory until evaluating all batches. Since this will breakdown the output.predictions into a list of lists of predictions, you will need to concatenate them before passing it into the metric function. My hack in the trainer_qa.py file is

predictions = [[], [], []]
for t in output.predictions:
    for i, item in enumerate(t):
        predictions[i].append(item)
# save and concat the start and end logits
predictions = [np.concatenate(item) for item in predictions[:2]]

The method prevent my evaluation from running slower and slower after each batch.

1 Like