Save only best model in Trainer

I have read previous posts on the similar topic but could not conclude if there is a workaround to get only the best model saved and not the checkpoint at every step, my disk space goes full even after I add savetotallimit as 5 as the trainer saves every checkpoint to disk at the start.
Please suggest.
Thanks

1 Like

You can set save_strategy to NO to avoid saving anything and save the final model once training is done with trainer.save_model().

1 Like

Thank you, this is helpful.

I am using the below -

  args = TrainingArguments(
      output_dir=f"./out_fold{i}",
      overwrite_output_dir = 'True',
      evaluation_strategy="steps",
      eval_steps=40,
      logging_steps = 40,
      learning_rate = 5e-5,
      per_device_train_batch_size=8,
      per_device_eval_batch_size=8,
      num_train_epochs=10,
      seed=0,
      save_total_limit = 1,
      # report_to = "none",
  #     logging_steps = 'epoch',
      load_best_model_at_end=True,
      save_strategy = "no"
  )
  trainer = Trainer(
      model=model,
      args=args,
      train_dataset=train_dataset,
      eval_dataset=val_dataset,
      # compute_metrics=compute_metrics,
      
      # callbacks=[EarlyStoppingCallback(early_stopping_pa)],
  )
  trainer.train()
  trainer.save_model(f'out_fold{i}')

Here, thought save_strategy = “no” , the checkpoints are saved at start in disk (as below) due to which disk goes full. Can you suggest what’s going wrong?

***** Running Evaluation *****
Num examples = 567
Batch size = 8
Saving model checkpoint to ./out_fold0/checkpoint-40
Configuration saved in ./out_fold0/checkpoint-40/config.json
Model weights saved in ./out_fold0/checkpoint-40/pytorch_model.bin
Deleting older checkpoint [out_fold0/checkpoint-760] due to args.save_total_limit
***** Running Evaluation *****

You can’t use load_best_model_at_end=True if you don’t want to save checkpoints: it needs to save checkpoints at every evaluation to make sure you have the best model, and it will always save 2 checkpoints (even if save_total_limit is 1): the best one and the last one (to resume an interrupted training).

4 Likes

If save_total_limit is set to some value, will checkpoints be replaced by newer checkpoints even if the newer checkpoints are underperforming?

The best checkpoint is always kept, as is the last checkpoint (to make sure you can resume training from it).

Thanks @sgugger

I believe we have to set these parameters which will save 2 checkpoints (best one and last one) and to avoid saving checkpoints at every evaluation. Is that right?

save_total_limit = 2
save_strategy = “no”
load_best_model_at_end=False

@Vinayaks117, did those settings work for you to save the most recent and the best? I’d like to do the same.

@jbmaxwell Yes.

2 Likes

Great, thanks!