Autogenerated model cards not showing the best metrics when using "load_best_model_at_end=True"

I am fine-tuning a BERT model for a token classification task. My training arguments are as follows:

args = TrainingArguments(
    output_dir="bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=5,
    metric_for_best_model="f1",
    greater_is_better=True,
    load_best_model_at_end=True,
    learning_rate=2e-5,
    num_train_epochs=50,
    weight_decay=0.01,
    logging_steps=10,
    logging_strategy="epoch",
    push_to_hub=True
)

I am using the Trainer class to train the model:

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

I then train the model and push it to the Hub using the following code:

trainer.train()
trainer.push_to_hub()

The model was trained for 10 epochs and then stopped due to the EarlyStoppingCallback. The model achieved its highest F1 score on the 7th epoch, but the auto-generated model card on Hugging Face Hub displays the metrics from the final (10th) epoch, which has a lower F1 score.

This leaves me uncertain as to whether the model checkpoints that were pushed to the Hub are from the 7th epoch or the 10th epoch, as I had specified load_best_model_at_end=True in my TrainingArguments and expected the model card to reflect the metrics from the 7th epoch.

Is there a way to determine which epoch’s checkpoints were used in the model that was pushed to the Hub? I have looked through the HF Transformers and Hub documentation, but have not found any information related to this.

1 Like