Checkpoints and disk storage

I am fine tuning custom dataset using this link https://huggingface.co/transformers/master/custom_datasets.html
it is saving too many checkpoints that make my disk full, any suggestions?
thanks

3 Likes

If it’s the part using Trainer, you can use the argument save_total_limit (in TrainingArguments) to limit the number of checkpoints kept.

3 Likes

thanks, indeed its very useful for me,
One more question for my clarification, once it is saved, should I use the last checkpoint for using the model later? or my concept is not correct?
Also I want to know how many checkpoints should we have? can we make it one checkpoint?

There is a load_best_model argument too, that will automatically load your best model (according to a metric you choose). It’s all in the docs :wink:

3 Likes

@sgugger How “smart” is this feature? I remember that in openNMT when you specify a max checkpoint this does not take into account the best evaluated checkpoint up to that point. In other words, if there is an old checkpoint that is the best, it can still get deleted. That should not be the case, so I am wondering whether the implementation in the transformers trainer is a bit smarter?

thanks for information, I am looking into the doc

thanks , it solved my issue, esp working with ’ load_best_model_at_end’

However for this load_best_model_at_end , does it mean the models saved during checkpoints, for example, I am using distillBert, the best model here refers to the checkpoint?

It is smarter: the best checkpoint is always put at the top of the list of checkpoints, so it never gets deleted (or if it does, it’s a bug :wink: ).

5 Likes

Love it. Thanks!

1 Like

hi @BramVanroy, I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow, what I missed in the code?


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

save_total_limit=1 will always only keep one checkpoint. Remove this argument and try again.

1 Like

@BramVanroy many thanks for your reply. Sorry, I need to get training and validation loss to make graph. do you know how I can get the value from train_state.log_history and save them? I really appreciate your big help.

I use:

save_total_limit=1
evaluation_strategy=‘steps’
load_best_model_at_end=True

In my very limited experience, as I started in this business a couple of days ago, I believe that save_total_limit=n keeps n+1 models as the best model is always kept and always first. So, if you set it to 1, at any given time you will have the working model and the best model.

@sgugger @BramVanroy
transformers.TrainingArguments(
per_device_train_batch_size=8,
gradient_accumulation_steps=16,
warmup_steps=100,
num_train_epochs=2,
learning_rate=2e-5,
fp16=True,
logging_steps=1,
output_dir=“lora-alpaca”,
save_total_limit=3,
)
I’m able to generate checkpoint-500 directory with following structure
i. optimizer.pt
ii. pytorch_model.bin
iii. rng_state.pth
iv. scaler.pt
v. scheduler.pt
vi. trainer_state.json
vii. training_args.json

But the real problem starts here
When I ran the same code in different notebook file, I can’t see any checkpoint-500 folder
What’s the issue and resolution?

Please help me out!