Checkpoints and disk storage

shainaraza · October 7, 2020, 8:34pm

I am fine tuning custom dataset using this link https://huggingface.co/transformers/master/custom_datasets.html
it is saving too many checkpoints that make my disk full, any suggestions?
thanks

sgugger · October 8, 2020, 1:03pm

If it’s the part using Trainer, you can use the argument save_total_limit (in TrainingArguments) to limit the number of checkpoints kept.

shainaraza · October 8, 2020, 1:19pm

thanks, indeed its very useful for me,
One more question for my clarification, once it is saved, should I use the last checkpoint for using the model later? or my concept is not correct?
Also I want to know how many checkpoints should we have? can we make it one checkpoint?

sgugger · October 8, 2020, 1:20pm

There is a load_best_model argument too, that will automatically load your best model (according to a metric you choose). It’s all in the docs

BramVanroy · October 8, 2020, 1:20pm

@sgugger How “smart” is this feature? I remember that in openNMT when you specify a max checkpoint this does not take into account the best evaluated checkpoint up to that point. In other words, if there is an old checkpoint that is the best, it can still get deleted. That should not be the case, so I am wondering whether the implementation in the transformers trainer is a bit smarter?

shainaraza · October 8, 2020, 1:21pm

thanks for information, I am looking into the doc

shainaraza · October 8, 2020, 2:26pm

thanks , it solved my issue, esp working with ’ load_best_model_at_end’

shainaraza · October 8, 2020, 2:48pm

However for this load_best_model_at_end , does it mean the models saved during checkpoints, for example, I am using distillBert, the best model here refers to the checkpoint?

sgugger · October 9, 2020, 1:06pm

It is smarter: the best checkpoint is always put at the top of the list of checkpoints, so it never gets deleted (or if it does, it’s a bug ).

BramVanroy · October 9, 2020, 1:53pm

Love it. Thanks!

SUNM · May 20, 2023, 1:14am

hi @BramVanroy, I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow, what I missed in the code?


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

BramVanroy · May 23, 2023, 11:19am

save_total_limit=1 will always only keep one checkpoint. Remove this argument and try again.

SUNM · May 26, 2023, 1:10am

@BramVanroy many thanks for your reply. Sorry, I need to get training and validation loss to make graph. do you know how I can get the value from train_state.log_history and save them? I really appreciate your big help.

mirix · May 26, 2023, 7:22am

I use:

save_total_limit=1
evaluation_strategy=‘steps’
load_best_model_at_end=True

In my very limited experience, as I started in this business a couple of days ago, I believe that save_total_limit=n keeps n+1 models as the best model is always kept and always first. So, if you set it to 1, at any given time you will have the working model and the best model.

EswarAleti · July 24, 2023, 9:23am

@sgugger @BramVanroy
transformers.TrainingArguments(
per_device_train_batch_size=8,
gradient_accumulation_steps=16,
warmup_steps=100,
num_train_epochs=2,
learning_rate=2e-5,
fp16=True,
logging_steps=1,
output_dir=“lora-alpaca”,
save_total_limit=3,
)
I’m able to generate checkpoint-500 directory with following structure
i. optimizer.pt
ii. pytorch_model.bin
iii. rng_state.pth
iv. scaler.pt
v. scheduler.pt
vi. trainer_state.json
vii. training_args.json

But the real problem starts here
When I ran the same code in different notebook file, I can’t see any checkpoint-500 folder
What’s the issue and resolution?

Please help me out!

Aravindan · June 2, 2024, 11:51am

is it possible to save the outputs in some other volume like

training_args = TrainingArguments(output_dir="<path-to-external-volume>

Is it possible?

Topic		Replies	Views
Question Regarding trainer arguments:: load_best_model_at_end Beginners	2	1975	April 19, 2021
Saving only the best performing checkpoint 🤗Transformers	19	18282	May 23, 2023
Save only best model in Trainer 🤗Transformers	31	86756	June 25, 2024
Checkpoint missing Optimizer.pt? How to Resume? 🤗Transformers	7	5574	May 18, 2021
How to prevent too many checkpoints with run_clm.py? Beginners	5	1327	December 14, 2022

Checkpoints and disk storage

Related topics