Trainer.evaluate() is freezing

tylerkim · February 27, 2024, 4:52pm

Hello, I’m trying to train a RoBERTa model for sequence classification. Previously, I was able to train it with the “test_trainer” arguments. However, when I would subsequently run trainer.evaluate(), the process would complete and then stall out.

^trainer.evaluate() freezes here

When I tried incorporating the evaluation step into training arguments (evaluation_strategy), I would get the same error – after the first epoch/steps would finish training, the evaluation step would happen and then the whole process would freeze.

^evaluation_strategy in training arguments freezes here

CODE:

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="steps",
    logging_dir="./output/logs",
    logging_strategy="steps",
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=500,
    save_strategy="steps",
    load_best_model_at_end=True,
    save_total_limit=2,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

metric = Accuracy()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=compute_metrics
)

No error logs, just the progress bar stalling out.

On transformers v4.33.1

tylerkim · March 2, 2024, 7:26pm

bump on this

tylerkim · March 6, 2024, 2:43am

another bump

RitchieP · March 8, 2024, 3:00pm

I’m also encountering something similar when following the Fine tune whisper model tutorial.

My code looks something like this

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-eng-gen",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=1000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    ignore_data_skip=True
)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice_train,
    eval_dataset=common_voice_test,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

trainer.train()

Output from the trainer.train():

Dataset is an Iterable Dataset.

Would love to hear inputs and possible fixes on this issue.

ashim · July 28, 2024, 3:35pm

Bump. Can never eval if using multi-gpu.

Dingyun-Huang · February 3, 2025, 10:32am

Hi there,

I have met the same issue using the official run_qa.py script provided in transformers 4.40. For me, the problem happened exclusively for my fine-tuned RoBERTa-base. RoBERTa-large and BERT-base-uncased were okay.

My work-around on this is setting the --eval_do_concat_batches to False, which prevent the evaluation loop from saving all logits in memory until evaluating all batches. Since this will breakdown the output.predictions into a list of lists of predictions, you will need to concatenate them before passing it into the metric function. My hack in the trainer_qa.py file is

predictions = [[], [], []]
for t in output.predictions:
    for i, item in enumerate(t):
        predictions[i].append(item)
# save and concat the start and end logits
predictions = [np.concatenate(item) for item in predictions[:2]]

The method prevent my evaluation from running slower and slower after each batch.

Topic		Replies	Views
Trainer.evaluate() freezing 🤗Transformers	3	496	August 23, 2024
Trainer freezes/crashes after evaluation step 🤗Transformers	6	1601	April 16, 2024
Trainer.train() is stuck 🤗Transformers	5	7245	May 1, 2023
Trainer crashes during predict and with compute_metrics Beginners	4	2275	April 13, 2021
Trainer predict or evaluate returns zero for metrics 🤗Transformers	0	55	July 11, 2024

Trainer.evaluate() is freezing

Related topics