Saving only the best performing checkpoint

Hi,

Is there a parameter in config that allows us to save only the best performing checkpoint ?
Currently, multiple checkpoints are saved based on save_steps (, batch_size and dataset size). If we want to train the model for lets say 10 epochs and 7th epoch gives the best performance on validation set, then how can we just save the checkpoint from 7th epoch and ignore the rest.

Thanks.

3 Likes

There is no parameter for that yet, keeping this in mind for the future development of Trainer.

4 Likes

Is this implemented yet? If not, is there any other way to do this manually (read: hackishly) with the Trainer api? (Except of course, saving the model every step)

Hi Tanuj,
I have not seen any parameter for that. However, there is a workaround.

Use following combinations
evaluation_strategy =‘steps’,
eval_steps = 10, # Evaluation and Save happens every 10 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
load_best_model_at_end=True,

When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose

4 Likes

An option to use Keras callback:

@sgugger So if I understand this correctly, in this post Checkpoints and disk storage - #9 by sgugger you actually “confirm” that the proposed configuration from @karthikcs will actually achieve the asked functionality (from the original question), namely having

save_total_limit=1

will actually save and load ONLY the best performing checkpoint (save because the best is always on top and “never gets deleted”, load because there is only one endpoint left). Is this correct?

8 Likes

this is a great question. was it ever clarified?

1 Like

Can confirm that setting save_total_limit to whatever you want, even 1, will not interfere with Trainer’s ability to load the best model at end. Look at the source:

First, _sorted_checkpoints prioritizes checkpoints to keep while also making sure not to delete the best model, by rotating the best model to the front of the queue:

        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
        # Make sure we don't delete the best model.
        if self.state.best_model_checkpoint is not None:
            best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint)))
            for i in range(best_model_index, len(checkpoints_sorted) - 2):
                checkpoints_sorted[i], checkpoints_sorted[i + 1] = checkpoints_sorted[i + 1], checkpoints_sorted[i]
        return checkpoints_sorted

Finally, after sorting, if save_total_limit=1, this number is actually increased to 2, so that you always keep the best model:

        # If save_total_limit=1 with load_best_model_at_end=True, we could end up deleting the last checkpoint, which
        # we don't do to allow resuming.
        save_total_limit = self.args.save_total_limit
        if (
            self.state.best_model_checkpoint is not None
            and self.args.save_total_limit == 1
            and checkpoints_sorted[-1] != self.state.best_model_checkpoint
        ):
            save_total_limit = 2

What I am looking to do is save checkpoints frequently as not to waste compute time if I need to restart (so I set save_total_limit = 5 and save often). But I also want to keep bigger steps, like epochs, for future analysis for training rate success and so on. I definitely need to delete smaller steps because then I quickly use up the space. Is there an option to do this? Thanks!

It may depend on the size of your save_steps. In my case, I use:

save_steps = len(train_data[‘train’])//batch_size # after each epoch
warmup_steps = save_steps//10 # 10% of save_steps