Trainer .train (resume _from _checkpoint =True)

Hi all,

I’m trying to resume my training from a checkpoint
my training argument:
training_args = TrainingArguments(

output_dir=repo_name,
group_by_length=True,
per_device_train_batch_size=16,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8,
evaluation_strategy=“steps”,
num_train_epochs=50,
fp16=True,
save_steps=500,
eval_steps=400,
logging_steps=10,
learning_rate=5e-4,
warmup_steps=3000,
push_to_hub=True,
)

my trainer:
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=common_voice_train,
eval_dataset=common_voice_test,
tokenizer=processor.feature_extractor,
)

till here everything is fine

then my training command :
trainer.train(resume_from_checkpoint=True)

the error is:
----> 1 trainer.train(resume_from_checkpoint=True)

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1073 resume_from_checkpoint = get_last_checkpoint(args.output_dir)
1074 if resume_from_checkpoint is None:
→ 1075 raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
1076
1077 if resume_from_checkpoint is not None:

ValueError: No valid checkpoint found in output directory (stt-arabic-2)

any thought why ??

Probably you need to check if the models are saving in the checkpoint directory, You can also provide the checkpoint directory in the resume_from_checkpoint=‘checkpoint_dir’

can i push the checkpoints to huggingface hub ??

Welcome @maher13 :hugs: Have you tried the above answer (which seems to me is the right one).
There are multiple ways of pushing your model to hub, see more about it here.

yes, I did, thank you both for your help.