Hi, all!
I want to resume training from a checkpoint and I use the method trainer.train(resume_from_checkpoint=True)
(also tried trainer.train(resume_from_checkpoint=checkpoint_dir))
the train code is:
training_args = TrainingArguments(
output_dir=base_path, overwrite_output_dir=True, num_train_epochs=num_train_epochs,
learning_rate=5e-5, weight_decay=0.01, warmup_steps=10000, local_rank=args.local_rank,
per_device_train_batch_size=train_batch_size, per_device_eval_batch_size=eval_batch_size,save_total_limit=5,
save_strategy="steps", evaluation_strategy="steps",logging_steps=2500,save_steps=2500,eval_steps=2500,
load_best_model_at_end =True,
metric_for_best_model="eval_loss", logging_dir="temp",seed=1994,data_seed=1994)
custom_dataset = mydataset(args_file_path)
eval_ratio = 0.005
train_size =int( len(custom_dataset)*(1-eval_ratio))
eval_size = len(custom_dataset)-train_size
print(f"the evaluation size is {eval_size}")
train_dataset, eval_dataset = random_split(custom_dataset,[train_size,eval_size],generator=torch.Generator().manual_seed(1994))
trainer = TripletLossTrainer(
model=model, args=training_args,
data_collator=collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
transformers.utils.logging.set_verbosity_info()
trainer.train(resume_from_checkpoint=True)
trainer.save_model(base_path)
It truly loaded the latest model, but the training progress bar shows it starts training on step 1, not where it stopped before