Saving only the best performing checkpoint

vikasy95 · August 1, 2020, 12:53am

Hi,

Is there a parameter in config that allows us to save only the best performing checkpoint ?
Currently, multiple checkpoints are saved based on save_steps (, batch_size and dataset size). If we want to train the model for lets say 10 epochs and 7th epoch gives the best performance on validation set, then how can we just save the checkpoint from 7th epoch and ignore the rest.

Thanks.

sgugger · August 3, 2020, 12:44pm

There is no parameter for that yet, keeping this in mind for the future development of Trainer.

Tanuj · April 29, 2021, 8:34am

Is this implemented yet? If not, is there any other way to do this manually (read: hackishly) with the Trainer api? (Except of course, saving the model every step)

karthikcs · May 19, 2021, 3:31pm

Hi Tanuj,
I have not seen any parameter for that. However, there is a workaround.

Use following combinations
evaluation_strategy =‘steps’,
eval_steps = 10, # Evaluation and Save happens every 10 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
load_best_model_at_end=True,

When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose

monta · July 7, 2021, 6:30am

An option to use Keras callback:

whydinkov · August 26, 2021, 6:53pm

@sgugger So if I understand this correctly, in this post Checkpoints and disk storage - #9 by sgugger you actually “confirm” that the proposed configuration from @karthikcs will actually achieve the asked functionality (from the original question), namely having

save_total_limit=1

will actually save and load ONLY the best performing checkpoint (save because the best is always on top and “never gets deleted”, load because there is only one endpoint left). Is this correct?

michael-kingston · March 5, 2022, 2:13am

this is a great question. was it ever clarified?

jayelm · March 15, 2022, 4:09am

Can confirm that setting save_total_limit to whatever you want, even 1, will not interfere with Trainer’s ability to load the best model at end. Look at the source:

First, _sorted_checkpoints prioritizes checkpoints to keep while also making sure not to delete the best model, by rotating the best model to the front of the queue:

        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
        # Make sure we don't delete the best model.
        if self.state.best_model_checkpoint is not None:
            best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint)))
            for i in range(best_model_index, len(checkpoints_sorted) - 2):
                checkpoints_sorted[i], checkpoints_sorted[i + 1] = checkpoints_sorted[i + 1], checkpoints_sorted[i]
        return checkpoints_sorted

Finally, after sorting, if save_total_limit=1, this number is actually increased to 2, so that you always keep the best model:

        # If save_total_limit=1 with load_best_model_at_end=True, we could end up deleting the last checkpoint, which
        # we don't do to allow resuming.
        save_total_limit = self.args.save_total_limit
        if (
            self.state.best_model_checkpoint is not None
            and self.args.save_total_limit == 1
            and checkpoints_sorted[-1] != self.state.best_model_checkpoint
        ):
            save_total_limit = 2

dashapyly · April 22, 2022, 9:17pm

What I am looking to do is save checkpoints frequently as not to waste compute time if I need to restart (so I set save_total_limit = 5 and save often). But I also want to keep bigger steps, like epochs, for future analysis for training rate success and so on. I definitely need to delete smaller steps because then I quickly use up the space. Is there an option to do this? Thanks!

Thang · June 7, 2022, 5:50am

It may depend on the size of your save_steps. In my case, I use:

save_steps = len(train_data[‘train’])//batch_size # after each epoch
warmup_steps = save_steps//10 # 10% of save_steps

AlhitawiMohammed22 · April 5, 2023, 5:48am

Does this setting param save the best checkpoint after training finishes?
If we set :

save_total_limit = 1, 
save_best_checkpoint = True

or those settings:

predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size= 4,
        per_device_eval_batch_size= 4, 
        num_train_epochs= 10 ,
        fp16= True,
        learning_rate= 4e-5, 
        output_dir=f'models/', 
        logging_steps=10,
        save_steps=500,  
        eval_steps=500,  
        save_total_limit = 1,
        report_to=["tensorboard"],

if not of the 2 above: what is the correct setting param should I use to keep the best checkpoint at the end?

AlhitawiMohammed22 · April 5, 2023, 5:56am

do those settings save the best checkpoint at the end?
if I set save_total_limit=1 with save_best_checkponit

SUNM · May 20, 2023, 1:10am

HI @Thang , I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can save the best model , my code is as follow, what I missed in the code?


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

SUNM · May 20, 2023, 1:13am

Hi @AlhitawiMohammed22, I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow, what I missed in the code?


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

AlhitawiMohammed22 · May 20, 2023, 11:57am

It works when I pass a combination of three parameters:

  evaluation_strategy="steps",
  save_total_limit = 1,
  load_best_model_at_end =True,

In addition(optional), you can save the model at the end
trainer.save_model(output_dir = path/to/out
so you just need to change epoch to steps

SUNM · May 20, 2023, 2:00pm

@AlhitawiMohammed22 many thanks for your reply and time. Sorry , in this way it will give me 2 models one best model and other last model? Based on what metrics it give us the best model? Can we define the metric? And how I can get train and validation loss? Many thanks.

AlhitawiMohammed22 · May 20, 2023, 3:58pm

you can use model.eval() method on your test_set
or in another way to use your own metrics

def   compute_metrics():
  # do your metrics 
pass  to the trainer : 
` compute_metrics= compute_metrics,`
otherwise, it will use the default one you have to look to trainer `repo`

SUNM · May 20, 2023, 11:06pm

@AlhitawiMohammed22 Sorry, I mean thr training and validation loss during training. I want to have those numbers. Except tensorboard do u know other ways? Many thanksss

AlhitawiMohammed22 · May 21, 2023, 2:00am

@SUNM no problem at all.
It should be in your output dir/checkpoint/trainer_state.json
If you have ckp

SUNM · May 23, 2023, 1:26am

@AlhitawiMohammed22 , sorry, do you know how I can get them from jason file to make my graph?

Topic		Replies	Views
Question Regarding trainer arguments:: load_best_model_at_end Beginners	2	1955	April 19, 2021
Checkpoints and disk storage 🤗Transformers	15	8068	June 2, 2024
Save only best model in Trainer 🤗Transformers	31	85404	June 25, 2024
Saving checkpoints only on improvement 🤗Transformers	2	76	February 8, 2025
Change saving metric in Trainer Intermediate	2	1448	May 18, 2024

Saving only the best performing checkpoint

Related topics