Loading a model from local with best checkpoint

berkayberabi · October 23, 2020, 11:24am

Hi all,

I have trained a model and saved it, tokenizer as well. During the training I set the load_best_checkpoint_at_end to True and can see the test results, which are good

Now I have another file where I load the model and observe results on test data set. I want to be able to do this without training over and over again. But the test results in the second file where I load the model are worse than the ones right after training.

Is there a way to load the model with best validation checkpoint ?

This is how I save:

tokenizer.save_pretrained(model_directory)
trainer.save_model()

and this is how i load:

tokenizer = T5Tokenizer.from_pretrained(model_directory)
model = T5ForConditionalGeneration.from_pretrained(model_directory, return_dict=False)

valhalla · October 24, 2020, 7:44am

To load a particular checkpoint, just pass the path to the checkpoint-dir which would load the model from that checkpoint.

berkayberabi · October 24, 2020, 3:40pm

Yes but I do not know apriori which checkpoint is the best. I trained the model on another file and saved some of the checkpoints. Yes, I can track down the best checkpoint in the first file but it is not an optimal solution.

I believe that an ideal solution would be to only save the best checkpoint, or overwrite the existing checkpoint when model improves so that in the end I only have one model actually.

Is what I want more clear now?

Is there a way to save only the best checkpoint instead of many?

valhalla · October 29, 2020, 5:29pm

I am not sure if that’s possible with Trainer right now, pinging @sgugger

sgugger · October 30, 2020, 4:26pm

I don’t understand the question. With load_best_model_at_end the model loaded at the end of training is the one that had the best performance on your validation set. So when you save that model, you have the best model on this validation set.

If it’s crap on another set, it means your validation set was not representative of the performance you wanted and there is nothing we can do on Trainer to fix that.

berkayberabi · January 12, 2021, 3:08pm

I understand. This means that the last saved checkpoint is the checkpoint with the best validation score instead of the final weights if I am saving multiple checkpoints.

Am I right?

sgugger · January 12, 2021, 3:10pm

Yes, that’s what load_best_model_at_end=True does.

berkayberabi · January 12, 2021, 3:12pm

Thank you very much!

johnrobinsn · March 31, 2023, 1:22pm

What if you kill training midway through? Is there an easy way to ask the trainer to load the best model determined up to that point?

suzyahyah · April 4, 2023, 2:02pm

I’m currently doing

import os
from transformers.trainer_callback import TrainerState

save_dir = "your_trainer_save_directory"
ckpt_dirs = os.listdir(save_dir)
ckpt_dirs = sorted(ckpt_dirs, key=lambda x: int(x.split('-')[1])
last_ckpt = ckpt_dirs[-1]

state = TrainerState.load_from_json(f"{save_dir}/{last_ckpt}/trainer_state.json")

print(state.best_model_checkpoint) # your best ckpoint.

If the above code breaks or doesnt work because of an API change/versioning, you could try tracing the github code starting from the equivalent of training_args.load_best_model_at_end to see how the best model checkpoint directory is called.

frankl1 · September 24, 2023, 5:37am

The parameter save_total_limit of the TrainingArguments object can be set to 1 in order to save only the best checkpoint.

Note that the documentation says that when the best checkout and the last one are different from each other, both could be kept at the end. However, I have not seen this scenario so far.

Topic		Replies	Views
Question Regarding trainer arguments:: load_best_model_at_end Beginners	2	1950	April 19, 2021
Do trainer.save_model saves the best model? 🤗Transformers	3	6369	July 3, 2023
Does checkpoint have memory in the case of resume from checkpoint Beginners	0	222	February 28, 2024
Checkpoint vs model weight Beginners	2	4786	October 12, 2020
Differences in prediction from train end to checkpoint Beginners	3	842	September 11, 2023

Loading a model from local with best checkpoint

Related topics