Hi @sgugger,
My epochs got interrupted so I used resume-from-checkpoint to continue the training process, however, I was wondering if the last checkpoint has “memory” of previous checkpoints when I use resume from checkpoint and load best model at the end?
For example, in folder_before_interrupted/, I have checkpoint-8300 with eval_bleu of 21.1, while checkpoint-13000 has eval_bleu of 19.9,
when I continue from checkpoint-13000 with early stopping patience of 3, and using load_best_model_at_end, will it compare to checkpoint-8300 when deciding which is the best? (assuming checkpoint 8300 is in the patience range)
When I continue training, I am also saving new checkpoints in a separate folder folder_after_interrupted? Would this separation prevent it from seeing checkpoint 8300 or does checkpoint-13000 already has memory of everything before itself?
if it’s helpful, this is my trainer arguments for continuing from a checkpoint
args = Seq2SeqTrainingArguments(
output_dir=output_dir,
evaluation_strategy="steps",
eval_steps=config.eval_steps,
save_steps=config.eval_steps,
learning_rate=config.lr,
per_device_train_batch_size=config.batch_size,
per_device_eval_batch_size=config.batch_size,
weight_decay=config.weight_decay,
#save_total_limit=2,
num_train_epochs=config.epochs,
predict_with_generate=True,
load_best_model_at_end=True,
greater_is_better=True,
metric_for_best_model="bleu",
gradient_accumulation_steps=config.ga_value,
resume_from_checkpoint=model_checkpoint,
do_train=do_train,
#optim="adafactor",
fp16=False
)
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=dataset_dict_tokenized["train"],
eval_dataset=dataset_dict_tokenized["val"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)
train_output = trainer.train(resume_from_checkpoint=model_checkpoint)
Thanks!