Does checkpoint have memory in the case of resume from checkpoint

siwusing · February 28, 2024, 5:52pm

My epochs got interrupted so I used resume-from-checkpoint to continue the training process, however, I was wondering if the last checkpoint has “memory” of previous checkpoints when I use resume from checkpoint and load best model at the end?

For example, in folder_before_interrupted/, I have checkpoint-8300 with eval_bleu of 21.1, while checkpoint-13000 has eval_bleu of 19.9,
when I continue from checkpoint-13000 with early stopping patience of 3, and using load_best_model_at_end, will it compare to checkpoint-8300 when deciding which is the best? (assuming checkpoint 8300 is in the patience range)

When I continue training, I am also saving new checkpoints in a separate folder folder_after_interrupted? Would this separation prevent it from seeing checkpoint 8300 or does checkpoint-13000 already has memory of everything before itself?

if it’s helpful, this is my trainer arguments for continuing from a checkpoint

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="steps", 
    eval_steps=config.eval_steps,
    save_steps=config.eval_steps,
    learning_rate=config.lr,
    per_device_train_batch_size=config.batch_size,
    per_device_eval_batch_size=config.batch_size,
    weight_decay=config.weight_decay,
    #save_total_limit=2,
    num_train_epochs=config.epochs,
    predict_with_generate=True, 
    load_best_model_at_end=True,
    greater_is_better=True,
    metric_for_best_model="bleu", 
    gradient_accumulation_steps=config.ga_value,
    resume_from_checkpoint=model_checkpoint,
    do_train=do_train,
    #optim="adafactor", 
    fp16=False
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=dataset_dict_tokenized["train"],
    eval_dataset=dataset_dict_tokenized["val"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

train_output = trainer.train(resume_from_checkpoint=model_checkpoint)

Thanks!

Topic		Replies	Views
Loading a model from local with best checkpoint Beginners	10	32402	September 24, 2023
Choosing save_steps value and getting the best checkpoint 🤗Transformers	0	239	December 28, 2023
Resume_from_checkpoint Models	1	2345	June 25, 2024
Resume training from checkpoint Beginners	1	3033	January 5, 2023
Saving only the best performing checkpoint 🤗Transformers	19	18209	May 23, 2023

Does checkpoint have memory in the case of resume from checkpoint

Related topics