Trainer.evaluate() freezing

tylerkim · February 28, 2024, 12:03am

Hello, I’m trying to train a RoBERTa model for sequence classification. Previously, I was able to train it with the “test_trainer” arguments. However, when I would subsequently run trainer.evaluate(), the process would complete and then stall out.

^trainer.evaluate() freezes here

When I tried incorporating the evaluation step into training arguments (evaluation_strategy), I would get the same error – after the first epoch/steps would finish training, the evaluation step would happen and then the whole process would freeze.

^evaluation_strategy in training arguments freezes here

CODE:

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="steps",
    logging_dir="./output/logs",
    logging_strategy="steps",
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=500,
    save_strategy="steps",
    load_best_model_at_end=True,
    save_total_limit=2,
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

metric = Accuracy()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    compute_metrics=compute_metrics
)

No error logs, just the progress bar stalling out.

On transformers v4.33.1

tylerkim · March 6, 2024, 2:42am

one more bump

Littleetree · April 29, 2024, 11:25am

I met the same issue as you and had no idea what was happening. Have you figured out a solution?

lowend1hz · August 23, 2024, 9:57am

I’m running into a similar issue, but things typically stop after several training epochs and consistently fail to complete 25 epochs. All of the sudden things just stop progressing and the GPU stops doing any computation, but still has memory reserved; the memory is typically between 4-6GB on a GPU with 24GB available. CPU doesn’t ever show significant memory pressure either and usually has around 32-40GB of RAM available.

Topic		Replies	Views
Trainer.evaluate() is freezing Beginners	5	801	February 3, 2025
Trainer freezes/crashes after evaluation step 🤗Transformers	6	1631	April 16, 2024
Evaluation became slower and slower during Trainer.train() Beginners	8	4621	February 3, 2025
Trainer.train() is stuck 🤗Transformers	5	7321	May 1, 2023
Huggingface Trainer eval while training 🤗Transformers	1	722	December 31, 2022

Trainer.evaluate() freezing

Related topics