Saving only the best performing checkpoint

Hi,

Is there a parameter in config that allows us to save only the best performing checkpoint ?
Currently, multiple checkpoints are saved based on save_steps (, batch_size and dataset size). If we want to train the model for lets say 10 epochs and 7th epoch gives the best performance on validation set, then how can we just save the checkpoint from 7th epoch and ignore the rest.

Thanks.

4 Likes

There is no parameter for that yet, keeping this in mind for the future development of Trainer.

4 Likes

Is this implemented yet? If not, is there any other way to do this manually (read: hackishly) with the Trainer api? (Except of course, saving the model every step)

Hi Tanuj,
I have not seen any parameter for that. However, there is a workaround.

Use following combinations
evaluation_strategy =‘steps’,
eval_steps = 10, # Evaluation and Save happens every 10 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
load_best_model_at_end=True,

When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose

5 Likes

An option to use Keras callback:

@sgugger So if I understand this correctly, in this post Checkpoints and disk storage - #9 by sgugger you actually “confirm” that the proposed configuration from @karthikcs will actually achieve the asked functionality (from the original question), namely having

save_total_limit=1

will actually save and load ONLY the best performing checkpoint (save because the best is always on top and “never gets deleted”, load because there is only one endpoint left). Is this correct?

8 Likes

this is a great question. was it ever clarified?

1 Like

Can confirm that setting save_total_limit to whatever you want, even 1, will not interfere with Trainer’s ability to load the best model at end. Look at the source:

First, _sorted_checkpoints prioritizes checkpoints to keep while also making sure not to delete the best model, by rotating the best model to the front of the queue:

        checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
        # Make sure we don't delete the best model.
        if self.state.best_model_checkpoint is not None:
            best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint)))
            for i in range(best_model_index, len(checkpoints_sorted) - 2):
                checkpoints_sorted[i], checkpoints_sorted[i + 1] = checkpoints_sorted[i + 1], checkpoints_sorted[i]
        return checkpoints_sorted

Finally, after sorting, if save_total_limit=1, this number is actually increased to 2, so that you always keep the best model:

        # If save_total_limit=1 with load_best_model_at_end=True, we could end up deleting the last checkpoint, which
        # we don't do to allow resuming.
        save_total_limit = self.args.save_total_limit
        if (
            self.state.best_model_checkpoint is not None
            and self.args.save_total_limit == 1
            and checkpoints_sorted[-1] != self.state.best_model_checkpoint
        ):
            save_total_limit = 2
1 Like

What I am looking to do is save checkpoints frequently as not to waste compute time if I need to restart (so I set save_total_limit = 5 and save often). But I also want to keep bigger steps, like epochs, for future analysis for training rate success and so on. I definitely need to delete smaller steps because then I quickly use up the space. Is there an option to do this? Thanks!

It may depend on the size of your save_steps. In my case, I use:

save_steps = len(train_data[‘train’])//batch_size # after each epoch
warmup_steps = save_steps//10 # 10% of save_steps

Does this setting param save the best checkpoint after training finishes?
If we set :

save_total_limit = 1, 
save_best_checkpoint = True 

or those settings:

predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size= 4,
        per_device_eval_batch_size= 4, 
        num_train_epochs= 10 ,
        fp16= True,
        learning_rate= 4e-5, 
        output_dir=f'models/', 
        logging_steps=10,
        save_steps=500,  
        eval_steps=500,  
        save_total_limit = 1,
        report_to=["tensorboard"],

if not of the 2 above: what is the correct setting param should I use to keep the best checkpoint at the end?

do those settings save the best checkpoint at the end?
if I set save_total_limit=1 with save_best_checkponit

HI @Thang , I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can save the best model , my code is as follow, what I missed in the code?


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

Hi @AlhitawiMohammed22, I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow, what I missed in the code?


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()
1 Like

It works when I pass a combination of three parameters:

  evaluation_strategy="steps",
  save_total_limit = 1,
  load_best_model_at_end =True,

In addition(optional), you can save the model at the end
trainer.save_model(output_dir = path/to/out
so you just need to change epoch to steps

@AlhitawiMohammed22 many thanks for your reply and time. Sorry , in this way it will give me 2 models one best model and other last model? Based on what metrics it give us the best model? Can we define the metric? And how I can get train and validation loss? Many thanks.

1 Like

you can use model.eval() method on your test_set
or in another way to use your own metrics

def   compute_metrics():
  # do your metrics 
pass  to the trainer : 
` compute_metrics= compute_metrics,`
otherwise, it will use the default one you have to look to trainer `repo`

@AlhitawiMohammed22 Sorry, I mean thr training and validation loss during training. I want to have those numbers. Except tensorboard do u know other ways? Many thanksss

@SUNM no problem at all.
It should be in your output dir/checkpoint/trainer_state.json
If you have ckp

1 Like

@AlhitawiMohammed22 , sorry, do you know how I can get them from jason file to make my graph?