Trainer .train (resume _from _checkpoint =True)

Hi all,

I’m trying to resume my training from a checkpoint
my training argument:
training_args = TrainingArguments(

output_dir=repo_name,
group_by_length=True,
per_device_train_batch_size=16,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8,
evaluation_strategy=“steps”,
num_train_epochs=50,
fp16=True,
save_steps=500,
eval_steps=400,
logging_steps=10,
learning_rate=5e-4,
warmup_steps=3000,
push_to_hub=True,
)

my trainer:
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=common_voice_train,
eval_dataset=common_voice_test,
tokenizer=processor.feature_extractor,
)

till here everything is fine

then my training command :
trainer.train(resume_from_checkpoint=True)

the error is:
----> 1 trainer.train(resume_from_checkpoint=True)

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1073 resume_from_checkpoint = get_last_checkpoint(args.output_dir)
1074 if resume_from_checkpoint is None:
→ 1075 raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
1076
1077 if resume_from_checkpoint is not None:

ValueError: No valid checkpoint found in output directory (stt-arabic-2)

any thought why ??

1 Like

Probably you need to check if the models are saving in the checkpoint directory, You can also provide the checkpoint directory in the resume_from_checkpoint=‘checkpoint_dir’

1 Like

can i push the checkpoints to huggingface hub ??

Welcome @maher13 :hugs: Have you tried the above answer (which seems to me is the right one).
There are multiple ways of pushing your model to hub, see more about it here.

yes, I did, thank you both for your help.

If I don’t want to push my model to the hub?
how can I resume the training from specific check point?

Hey Eran,

Had a similar question a while ago - all i did was specify the checkpoint path in the model and tokenizer declaration

e.g.

tokenizer = AutoTokenizer.from_pretrained(“my_awesome_model/checkpoint-xxxx”)

model = AutoModelForSequenceClassification.from_pretrained(“my_awesome_model/checkpoint-xxxx”)

and then leave trainer.train() without any arguments since you are loading the pretrained checkpoint already.

1 Like

I came across this issue and I have a solution. I was using these for training_args for a whisper model finetune for context

You can still use trainer.train() with no arguments and have it resume from checkpoint. You just include this in your training_args:

training_args = Seq2SeqTrainingArguments(
output_dir=“/your/path”, # change to a repo name of your choice
resume_from_checkpoint=“/path/checkpoint-4000”,
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=12000,
gradient_checkpointing=True,
fp16=True, # Change to false if using CPU only
evaluation_strategy=“steps”,
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=[“tensorboard”],
load_best_model_at_end=True,
metric_for_best_model=“wer”,
greater_is_better=False,
push_to_hub=True,
)

So resume_from_checkpoint does not want true/false in this case, it wants a “str” as a input argument. You’re IDE should show you when you hover you’re cursor over it, what the variable type the parameter requires.

Go to your local machine where you save your checkpoints and literally copy the directory to the checkpoint, for example here is a linux path to a checkpoint: resume_from_checkpoint = “/home/name/whispertuning/whisper-small-random/checkpoint-4000”.

hfpic1

I did not change any model or tokenizers (although you can do it that way and instead just use the checkpoint as your model but if you want to continuously train a model this will not be efficient.)

Just keep track of you’re metric. If you train it to 4000 steps, and you run another train, just have a look to see if your metrics are actually improving such as the word_error_rate for speech to text from the previous checkpoint you were working from. I hope this answers your question physically and conceptually.

2 Likes

just want to understand it clearly – so when i resume from this checkpoint, it will not repeat the data it was already trained upon? (given the train and test set are the same size as before)

Correct. It will start training the model from the specified checkpoint and not regard the previous checkpoints.