How to update the model card with the best model load after a train

oMateos2020 · August 2, 2022, 1:27pm

Hi everyone, I am new to Huggingface and need it to my master’s final thesis. I have been training a Seq2Seq models with AutoModelForSeq2SeqLM and Seq2SeqTrainer and there is something that is not clear for myself.

I have set up the arguments to store all the checkpoints, load the best of them at the end of training to later push the best model to the Hub.

Here is the args code:

args = Seq2SeqTrainingArguments(
    output_dir=model_name,  
    dataloader_num_workers=2,
    eval_steps=299, #280
    evaluation_strategy="steps", #epoch
    fp16=True,
    generation_num_beams=2, 
    hub_model_id=model_id,
    hub_strategy="checkpoint",
    hub_token="hf_XXXXXXXXXXXXXXXXXXXXXX,
    learning_rate=1e-5*gradient_accumulation_steps, 
    load_best_model_at_end=True,
    logging_steps=299,
    logging_strategy="steps", 
    metric_for_best_model="rouge1",
    num_train_epochs=1,
    optim="adafactor", 
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    push_to_hub=True,
    save_strategy="steps", 
    save_steps=299,
    #save_total_limit=3,
    report_to="wandb",
    warmup_steps=598, 
    weight_decay=0.01,
    gradient_accumulation_steps = gradient_accumulation_steps,
    #gradient_checkpointing=True,
    label_smoothing_factor = 0.1,
    group_by_length=False
)

Once the train finishes by early stopping I get the message that the best model was load:

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from pegasus-newsroom-cnn_full-adafactor-bs6/checkpoint-598 (score: 39.4265).

That makes sense so I push it to the hub with trainer.push_to_hub() but when I go the the model hub the model card details are not the ones from the best model evaluation which was at step 598 as follows below, instead of this the last step’s details are filled which is confusing about what actual step was pushed.

Training results

Training Loss Epoch Step Validation Loss Rouge1 Rouge2 Rougel Rougelsum Gen Len

3.2894 0.1 299 2.9464 39.4079 18.3064 28.093 36.5182 64.6904

3.0427 0.2 598 2.9307 39.4265 18.2924 28.247 36.6382 60.5696

3.1017 0.3 897 2.9891 39.0977 17.9198 27.9078 36.2363 58.5172

3.2891 0.4 1196 3.5756 29.5555 11.7552 22.4675 27.2432 45.0232

637.0317 0.5 1495 nan 0.0 0.0 0.0 0.0 1.0

And the results on the model card:

This model was trained from scratch on the cnn_dailymail dataset.
It achieves the following results on the evaluation set:

Loss: nan

Rouge1: 0.0

Rouge2: 0.0

Rougel: 0.0

Rougelsum: 0.0

Gen Len: 1.0

More confusing, if I go to the model’s files on the repo for the model on the hub where the files are described as step 299 as follows:

My question is, is there any way to guarantee the best model is load and pushed to the hub? Is it possible to update the model card with that checkpoint evaluation step results, or should I do it manually?

I am very confused about this and if the model uploaded info was regarding the best model which was the step 590 and not the last one nor the step 299.

Many thanks in advance,
Oscar

xshubhamx · April 13, 2024, 11:34pm

Hey, the best model is automatically pushed to hub according to this answer: Clarification on push_to_hub, best model, and model card

But the issue is that the model card shows the wrong metrics. Were you able to solve this issue? Can we write some code to edit the model card with the best metrics instead of the metrics of the last epoch?

Topic		Replies	Views
Prakash Hinduja Switzerland (Swiss) How do I load a pre-trained model in Hugging Face? Beginners	1	20	June 26, 2025
Trainer "load_best_model_at_end" doesn't load the best model Intermediate	0	2552	February 21, 2023
Saving-Loading Model in Colab and Making Predictions Beginners	2	15322	June 15, 2021
Autogenerated model cards not showing the best metrics when using "load_best_model_at_end=True" 🤗Hub	0	529	December 24, 2022
How to load after calling trainer.model.push_to_hub() on a fine tuned model? 🤗Transformers	1	898	October 9, 2023

Training Loss	Epoch	Step	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum	Gen Len
3.2894	0.1	299	2.9464	39.4079	18.3064	28.093	36.5182	64.6904
3.0427	0.2	598	2.9307	39.4265	18.2924	28.247	36.6382	60.5696
3.1017	0.3	897	2.9891	39.0977	17.9198	27.9078	36.2363	58.5172
3.2891	0.4	1196	3.5756	29.5555	11.7552	22.4675	27.2432	45.0232
637.0317	0.5	1495	nan	0.0	0.0	0.0	0.0	1.0

How to update the model card with the best model load after a train

Training results

Related topics