Trainer .train (resume _from _checkpoint =True)

maher13 · December 25, 2021, 1:01am

Hi all,

I’m trying to resume my training from a checkpoint
my training argument:
training_args = TrainingArguments(

output_dir=repo_name,
group_by_length=True,
per_device_train_batch_size=16,
per_device_eval_batch_size=1,
gradient_accumulation_steps=8,
evaluation_strategy=“steps”,
num_train_epochs=50,
fp16=True,
save_steps=500,
eval_steps=400,
logging_steps=10,
learning_rate=5e-4,
warmup_steps=3000,
push_to_hub=True,
)

my trainer:
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=common_voice_train,
eval_dataset=common_voice_test,
tokenizer=processor.feature_extractor,
)

till here everything is fine

then my training command :
trainer.train(resume_from_checkpoint=True)

the error is:
----> 1 trainer.train(resume_from_checkpoint=True)

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1073 resume_from_checkpoint = get_last_checkpoint(args.output_dir)
1074 if resume_from_checkpoint is None:
→ 1075 raise ValueError(f"No valid checkpoint found in output directory ({args.output_dir})")
1076
1077 if resume_from_checkpoint is not None:

ValueError: No valid checkpoint found in output directory (stt-arabic-2)

any thought why ??

bengul · December 25, 2021, 3:42pm

Probably you need to check if the models are saving in the checkpoint directory, You can also provide the checkpoint directory in the resume_from_checkpoint=‘checkpoint_dir’

maher13 · December 28, 2021, 11:44am

can i push the checkpoints to huggingface hub ??

merve · December 29, 2021, 5:16am

Welcome @maher13 Have you tried the above answer (which seems to me is the right one).
There are multiple ways of pushing your model to hub, see more about it here.

maher13 · March 7, 2022, 11:49am

yes, I did, thank you both for your help.

EranML · April 28, 2023, 3:05pm

If I don’t want to push my model to the hub?
how can I resume the training from specific check point?

JaySyd · May 1, 2023, 1:59pm

Hey Eran,

Had a similar question a while ago - all i did was specify the checkpoint path in the model and tokenizer declaration

e.g.

tokenizer = AutoTokenizer.from_pretrained(“my_awesome_model/checkpoint-xxxx”)

model = AutoModelForSequenceClassification.from_pretrained(“my_awesome_model/checkpoint-xxxx”)

and then leave trainer.train() without any arguments since you are loading the pretrained checkpoint already.

Cameron-oos · January 22, 2024, 7:28am

I came across this issue and I have a solution. I was using these for training_args for a whisper model finetune for context

You can still use trainer.train() with no arguments and have it resume from checkpoint. You just include this in your training_args:

training_args = Seq2SeqTrainingArguments(
output_dir=“/your/path”, # change to a repo name of your choice
resume_from_checkpoint=“/path/checkpoint-4000”,
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=12000,
gradient_checkpointing=True,
fp16=True, # Change to false if using CPU only
evaluation_strategy=“steps”,
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=[“tensorboard”],
load_best_model_at_end=True,
metric_for_best_model=“wer”,
greater_is_better=False,
push_to_hub=True,
)

So resume_from_checkpoint does not want true/false in this case, it wants a “str” as a input argument. You’re IDE should show you when you hover you’re cursor over it, what the variable type the parameter requires.

Go to your local machine where you save your checkpoints and literally copy the directory to the checkpoint, for example here is a linux path to a checkpoint: resume_from_checkpoint = “/home/name/whispertuning/whisper-small-random/checkpoint-4000”.

hfpic1

I did not change any model or tokenizers (although you can do it that way and instead just use the checkpoint as your model but if you want to continuously train a model this will not be efficient.)

Just keep track of you’re metric. If you train it to 4000 steps, and you run another train, just have a look to see if your metrics are actually improving such as the word_error_rate for speech to text from the previous checkpoint you were working from. I hope this answers your question physically and conceptually.

aben118 · May 16, 2024, 3:48am

just want to understand it clearly – so when i resume from this checkpoint, it will not repeat the data it was already trained upon? (given the train and test set are the same size as before)

Cameron-oos · May 16, 2024, 10:05am

Correct. It will start training the model from the specified checkpoint and not regard the previous checkpoints.

Topic		Replies	Views
Resume training from checkpoint Beginners	1	3033	January 5, 2023
How to resume training from a checkpoint using huggingface trainer 🤗Transformers	5	171	May 8, 2025
Resume_from_checkpoint Models	1	2344	June 25, 2024
If train resume_from_checkpoint, can't change trainerarguments? Beginners	3	476	February 5, 2024
Continuing Pre Training from Model Checkpoint Models	12	42022	January 13, 2025

Trainer .train (resume _from _checkpoint =True)

Related topics