I am fine tuning custom dataset using this link https://huggingface.co/transformers/master/custom_datasets.html
it is saving too many checkpoints that make my disk full, any suggestions?
thanks
If itâs the part using Trainer
, you can use the argument save_total_limit
(in TrainingArguments
) to limit the number of checkpoints kept.
thanks, indeed its very useful for me,
One more question for my clarification, once it is saved, should I use the last checkpoint for using the model later? or my concept is not correct?
Also I want to know how many checkpoints should we have? can we make it one checkpoint?
There is a load_best_model
argument too, that will automatically load your best model (according to a metric you choose). Itâs all in the docs
@sgugger How âsmartâ is this feature? I remember that in openNMT when you specify a max checkpoint this does not take into account the best evaluated checkpoint up to that point. In other words, if there is an old checkpoint that is the best, it can still get deleted. That should not be the case, so I am wondering whether the implementation in the transformers trainer is a bit smarter?
thanks for information, I am looking into the doc
thanks , it solved my issue, esp working with â load_best_model_at_endâ
However for this load_best_model_at_end , does it mean the models saved during checkpoints, for example, I am using distillBert, the best model here refers to the checkpoint?
It is smarter: the best checkpoint is always put at the top of the list of checkpoints, so it never gets deleted (or if it does, itâs a bug ).
Love it. Thanks!
hi @BramVanroy, I hope you are well. sorry during training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow, what I missed in the code?
training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)
Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
'attention_mask': torch.stack([f[1] for f in data]),
'labels': torch.stack([f[0] for f in data])}).train()
save_total_limit=1
will always only keep one checkpoint. Remove this argument and try again.
@BramVanroy many thanks for your reply. Sorry, I need to get training and validation loss to make graph. do you know how I can get the value from train_state.log_history and save them? I really appreciate your big help.
I use:
save_total_limit=1
evaluation_strategy=âstepsâ
load_best_model_at_end=True
In my very limited experience, as I started in this business a couple of days ago, I believe that save_total_limit=n keeps n+1 models as the best model is always kept and always first. So, if you set it to 1, at any given time you will have the working model and the best model.
@sgugger @BramVanroy
transformers.TrainingArguments(
per_device_train_batch_size=8,
gradient_accumulation_steps=16,
warmup_steps=100,
num_train_epochs=2,
learning_rate=2e-5,
fp16=True,
logging_steps=1,
output_dir=âlora-alpacaâ,
save_total_limit=3,
)
Iâm able to generate checkpoint-500 directory with following structure
i. optimizer.pt
ii. pytorch_model.bin
iii. rng_state.pth
iv. scaler.pt
v. scheduler.pt
vi. trainer_state.json
vii. training_args.json
But the real problem starts here
When I ran the same code in different notebook file, I canât see any checkpoint-500 folder
Whatâs the issue and resolution?
Please help me out!
is it possible to save the outputs in some other volume like
training_args = TrainingArguments(output_dir="<path-to-external-volume>
Is it possible?