Is there a parameter in config that allows us to save only the best performing checkpoint ?
Currently, multiple checkpoints are saved based on save_steps (, batch_size and dataset size). If we want to train the model for lets say 10 epochs and 7th epoch gives the best performance on validation set, then how can we just save the checkpoint from 7th epoch and ignore the rest.
Is this implemented yet? If not, is there any other way to do this manually (read: hackishly) with the Trainer api? (Except of course, saving the model every step)
Hi Tanuj,
I have not seen any parameter for that. However, there is a workaround.
Use following combinations
evaluation_strategy =âstepsâ,
eval_steps = 10, # Evaluation and Save happens every 10 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
load_best_model_at_end=True,
When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose
@sgugger So if I understand this correctly, in this post Checkpoints and disk storage - #9 by sgugger you actually âconfirmâ that the proposed configuration from @karthikcs will actually achieve the asked functionality (from the original question), namely having
save_total_limit=1
will actually save and load ONLY the best performing checkpoint (save because the best is always on top and ânever gets deletedâ, load because there is only one endpoint left). Is this correct?
Can confirm that setting save_total_limit to whatever you want, even 1, will not interfere with Trainerâs ability to load the best model at end. Look at the source:
First, _sorted_checkpoints prioritizes checkpoints to keep while also making sure not to delete the best model, by rotating the best model to the front of the queue:
checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
# Make sure we don't delete the best model.
if self.state.best_model_checkpoint is not None:
best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint)))
for i in range(best_model_index, len(checkpoints_sorted) - 2):
checkpoints_sorted[i], checkpoints_sorted[i + 1] = checkpoints_sorted[i + 1], checkpoints_sorted[i]
return checkpoints_sorted
Finally, after sorting, if save_total_limit=1, this number is actually increased to 2, so that you always keep the best model:
# If save_total_limit=1 with load_best_model_at_end=True, we could end up deleting the last checkpoint, which
# we don't do to allow resuming.
save_total_limit = self.args.save_total_limit
if (
self.state.best_model_checkpoint is not None
and self.args.save_total_limit == 1
and checkpoints_sorted[-1] != self.state.best_model_checkpoint
):
save_total_limit = 2
What I am looking to do is save checkpoints frequently as not to waste compute time if I need to restart (so I set save_total_limit = 5 and save often). But I also want to keep bigger steps, like epochs, for future analysis for training rate success and so on. I definitely need to delete smaller steps because then I quickly use up the space. Is there an option to do this? Thanks!
HI @Thang , I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can save the best model , my code is as follow, what I missed in the code?
training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)
Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data])}).train()
Hi @AlhitawiMohammed22, I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow, what I missed in the code?
training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)
Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data])}).train()
@AlhitawiMohammed22 many thanks for your reply and time. Sorry , in this way it will give me 2 models one best model and other last model? Based on what metrics it give us the best model? Can we define the metric? And how I can get train and validation loss? Many thanks.
you can use model.eval() method on your test_set
or in another way to use your own metrics
def compute_metrics():
# do your metrics
pass to the trainer :
` compute_metrics= compute_metrics,`
otherwise, it will use the default one you have to look to trainer `repo`
@AlhitawiMohammed22 Sorry, I mean thr training and validation loss during training. I want to have those numbers. Except tensorboard do u know other ways? Many thanksss