Resume training from checkpoint

Hi, all!
I want to resume training from a checkpoint and I use the method trainer.train(resume_from_checkpoint=True)
(also tried trainer.train(resume_from_checkpoint=checkpoint_dir))
the train code is:

    training_args = TrainingArguments(
        output_dir=base_path, overwrite_output_dir=True, num_train_epochs=num_train_epochs,
        learning_rate=5e-5, weight_decay=0.01, warmup_steps=10000, local_rank=args.local_rank,
        per_device_train_batch_size=train_batch_size, per_device_eval_batch_size=eval_batch_size,save_total_limit=5,
        save_strategy="steps", evaluation_strategy="steps",logging_steps=2500,save_steps=2500,eval_steps=2500,
        load_best_model_at_end  =True,
        metric_for_best_model="eval_loss", logging_dir="temp",seed=1994,data_seed=1994)

    custom_dataset = mydataset(args_file_path)
    eval_ratio = 0.005
    train_size =int( len(custom_dataset)*(1-eval_ratio))
    eval_size = len(custom_dataset)-train_size
    print(f"the evaluation size is {eval_size}")
    train_dataset, eval_dataset = random_split(custom_dataset,[train_size,eval_size],generator=torch.Generator().manual_seed(1994))
    trainer = TripletLossTrainer(
        model=model, args=training_args,
        data_collator=collator,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset
    )

    transformers.utils.logging.set_verbosity_info()
    trainer.train(resume_from_checkpoint=True)
    trainer.save_model(base_path)

It truly loaded the latest model, but the training progress bar shows it starts training on step 1, not where it stopped before

1 Like

I’m seeing the same behavior. Transformers logs Loading model from models/.../checkpoint-13000, but when training begins the epoch/step count ends up reset to 0 anyway.