Save only best model in Trainer

prachi12 · July 17, 2021, 6:25am

I have read previous posts on the similar topic but could not conclude if there is a workaround to get only the best model saved and not the checkpoint at every step, my disk space goes full even after I add savetotallimit as 5 as the trainer saves every checkpoint to disk at the start.
Please suggest.
Thanks

sgugger · July 17, 2021, 8:05am

You can set save_strategy to NO to avoid saving anything and save the final model once training is done with trainer.save_model().

prachi12 · July 17, 2021, 10:27am

Thank you, this is helpful.

prachi12 · July 17, 2021, 4:22pm

I am using the below -

  args = TrainingArguments(
      output_dir=f"./out_fold{i}",
      overwrite_output_dir = 'True',
      evaluation_strategy="steps",
      eval_steps=40,
      logging_steps = 40,
      learning_rate = 5e-5,
      per_device_train_batch_size=8,
      per_device_eval_batch_size=8,
      num_train_epochs=10,
      seed=0,
      save_total_limit = 1,
      # report_to = "none",
  #     logging_steps = 'epoch',
      load_best_model_at_end=True,
      save_strategy = "no"
  )
  trainer = Trainer(
      model=model,
      args=args,
      train_dataset=train_dataset,
      eval_dataset=val_dataset,
      # compute_metrics=compute_metrics,
      
      # callbacks=[EarlyStoppingCallback(early_stopping_pa)],
  )
  trainer.train()
  trainer.save_model(f'out_fold{i}')

Here, thought save_strategy = “no” , the checkpoints are saved at start in disk (as below) due to which disk goes full. Can you suggest what’s going wrong?

***** Running Evaluation *****
Num examples = 567
Batch size = 8
Saving model checkpoint to ./out_fold0/checkpoint-40
Configuration saved in ./out_fold0/checkpoint-40/config.json
Model weights saved in ./out_fold0/checkpoint-40/pytorch_model.bin
Deleting older checkpoint [out_fold0/checkpoint-760] due to args.save_total_limit
***** Running Evaluation *****

sgugger · July 19, 2021, 9:21am

You can’t use load_best_model_at_end=True if you don’t want to save checkpoints: it needs to save checkpoints at every evaluation to make sure you have the best model, and it will always save 2 checkpoints (even if save_total_limit is 1): the best one and the last one (to resume an interrupted training).

carted-ml · March 30, 2022, 10:14am

If save_total_limit is set to some value, will checkpoints be replaced by newer checkpoints even if the newer checkpoints are underperforming?

sgugger · March 30, 2022, 2:46pm

The best checkpoint is always kept, as is the last checkpoint (to make sure you can resume training from it).

Vinayaks117 · March 30, 2022, 6:15pm

Thanks @sgugger

I believe we have to set these parameters which will save 2 checkpoints (best one and last one) and to avoid saving checkpoints at every evaluation. Is that right?

save_total_limit = 2
save_strategy = “no”
load_best_model_at_end=False

jbmaxwell · May 7, 2022, 10:39pm

@Vinayaks117, did those settings work for you to save the most recent and the best? I’d like to do the same.

Vinayaks117 · May 8, 2022, 7:55am

@jbmaxwell Yes.

jbmaxwell · May 8, 2022, 3:34pm

Great, thanks!

ahadda5 · July 12, 2022, 8:58am

tbh it didn’t work for me, (version 4.20.1) not sure what i’m missing.

So instead i’m running a cron job every 15 min to clean up those checkpoints.

file handlers limit gets reached after 10hrs ,so as a last resort the cron job is cleaning those files

ahadda5 · July 12, 2022, 1:09pm

My bad it works, it was an issue with how i passed params through sys.argv

artificial-cerebrum · September 12, 2022, 9:15am

@sgugger @prachi12 how do you know which checkpoint had the best performance and how do you load that specific checkpoint?

astariul · November 15, 2022, 3:49am

@artificial-cerebrum I had the same question, I couldn’t find the answer from the documentation. After checking the source code, I found :

# Define your trainer, etc...
trainer.train()

# After training, access the path of the best checkpoint like this
best_ckpt_path = trainer.state.best_model_checkpoint

artificial-cerebrum · December 11, 2022, 2:58pm

@astariul - Thanks! That seems to be working. For future readers, I think another option would be to remove the whole directory (rm -r $save_path) after training, and then do trainer.save_model(). It seems that this way it saves only the best model (assuming you had enabled load_best_model=True). Alternatively, if you don’t want to delete the checkpoints, then you can avoid rm -r $save_path, and provide a new output_dir path to trainer.save_model(output_dir=new_path).

IdoAmit198 · December 12, 2022, 7:55am

Hey @artificial-cerebral , can you please share a code example of how you do that?

SUNM · May 20, 2023, 12:57am

HI @prachi12 , I hope you are well. sorry dusing training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can sav ethe best model , my code is as follow


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

SUNM · May 20, 2023, 12:58am

Hi @IdoAmit198 , I hope you are well. sorry dusing training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can sav ethe best model , my code is as follow


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

SUNM · May 20, 2023, 12:59am

HI @Vinayaks117 , I hope you are well. sorry dusing training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can sav ethe best model , my code is as follow


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

Topic		Replies	Views
Checkpoints and disk storage 🤗Transformers	15	8050	June 2, 2024
Question Regarding trainer arguments:: load_best_model_at_end Beginners	2	1950	April 19, 2021
Saving only the best performing checkpoint 🤗Transformers	19	18210	May 23, 2023
Behaviour change in checkpoints saved by Trainer 🤗Transformers	0	961	July 17, 2023
Disable checkpointing in Trainer 🤗Transformers	4	7804	January 10, 2022

Save only best model in Trainer

Related topics