Hi all,
I’m trying to resume my training from a checkpoint
my training argument:
training_args = TrainingArguments(
output_dir=repo_name,
group_by_length=True,
per_device_train_batch_size=16,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8,
evaluation_strategy=“steps”,
num_train_epochs=50,
fp16=True,
save_steps=500,
eval_steps=400,
logging_steps=10,
learning_rate=5e-4,
warmup_steps=3000,
push_to_hub=True,
)
my trainer:
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=common_voice_train,
eval_dataset=common_voice_test,
tokenizer=processor.feature_extractor,
)
till here everything is fine
then my training command :
trainer.train(resume_from_checkpoint=True)
the error is:
----> 1 trainer.train(resume_from_checkpoint=True)
/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1073 resume_from_checkpoint = get_last_checkpoint(args.output_dir)
1074 if resume_from_checkpoint is None:
→ 1075 raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
1076
1077 if resume_from_checkpoint is not None:
ValueError: No valid checkpoint found in output directory (stt-arabic-2)
any thought why ??