Loading a model from local with best checkpoint

Hi all,

I have trained a model and saved it, tokenizer as well. During the training I set the load_best_checkpoint_at_end to True and can see the test results, which are good

Now I have another file where I load the model and observe results on test data set. I want to be able to do this without training over and over again. But the test results in the second file where I load the model are worse than the ones right after training.

Is there a way to load the model with best validation checkpoint ?

This is how I save:

tokenizer.save_pretrained(model_directory)
trainer.save_model()

and this is how i load:

tokenizer = T5Tokenizer.from_pretrained(model_directory)
model = T5ForConditionalGeneration.from_pretrained(model_directory, return_dict=False)

To load a particular checkpoint, just pass the path to the checkpoint-dir which would load the model from that checkpoint.

1 Like

Yes but I do not know apriori which checkpoint is the best. I trained the model on another file and saved some of the checkpoints. Yes, I can track down the best checkpoint in the first file but it is not an optimal solution.

I believe that an ideal solution would be to only save the best checkpoint, or overwrite the existing checkpoint when model improves so that in the end I only have one model actually.

Is what I want more clear now?

Is there a way to save only the best checkpoint instead of many?

1 Like

I am not sure if that’s possible with Trainer right now, pinging @sgugger

1 Like

I don’t understand the question. With load_best_model_at_end the model loaded at the end of training is the one that had the best performance on your validation set. So when you save that model, you have the best model on this validation set.

If it’s crap on another set, it means your validation set was not representative of the performance you wanted and there is nothing we can do on Trainer to fix that.

4 Likes

I understand. This means that the last saved checkpoint is the checkpoint with the best validation score instead of the final weights if I am saving multiple checkpoints.

Am I right?

Yes, that’s what load_best_model_at_end=True does.

1 Like

Thank you very much!

What if you kill training midway through? Is there an easy way to ask the trainer to load the best model determined up to that point?

2 Likes

I’m currently doing

import os
from transformers.trainer_callback import TrainerState

save_dir = "your_trainer_save_directory"
ckpt_dirs = os.listdir(save_dir)
ckpt_dirs = sorted(ckpt_dirs, key=lambda x: int(x.split('-')[1])
last_ckpt = ckpt_dirs[-1]

state = TrainerState.load_from_json(f"{save_dir}/{last_ckpt}/trainer_state.json")

print(state.best_model_checkpoint) # your best ckpoint.

If the above code breaks or doesnt work because of an API change/versioning, you could try tracing the github code starting from the equivalent of training_args.load_best_model_at_end to see how the best model checkpoint directory is called.

1 Like

The parameter save_total_limit of the TrainingArguments object can be set to 1 in order to save only the best checkpoint.

Note that the documentation says that when the best checkout and the last one are different from each other, both could be kept at the end. However, I have not seen this scenario so far.