When training a model with something like:
training_args = Seq2SeqTrainingArguments(
f"my-model-train",
push_to_hub=True,
evaluation_strategy="steps",
#per_device_train_batch_size=batch_size,
#per_device_eval_batch_size=batch_size,
auto_find_batch_size=True,
predict_with_generate=True,
logging_steps=5000,
save_steps=5000,
eval_steps=20_000,
warmup_steps=500,
max_steps=200_000,
# overwrite_output_dir=True,
save_total_limit=5,
metric_for_best_model="bleu"
)
trainer = Seq2SeqTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=ds_train.with_format("torch"),
eval_dataset=ds_valid.with_format("torch"),
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train(); trainer.push_to_hub("my-model-train")
Sometimes the VM kernels dies before the end of max_steps
during trainer.train()
and the checkpoints don’t get pushed to hub. And when that happens, doing resume_from_checkpoint
would not work.
model = AutoModelForSeq2SeqLM.from_pretrained("my-model-train")
trainer.train(resume_from_checkpoint=True)
Is it possible to set training arguments to push_to_hub
at every save_steps
or eval_steps
and not just when then model finishes at max_steps
?
Currently my workaround is doing something like:
for i in range(20):
max_each = 20_000 * (i+1)
training_args = Seq2SeqTrainingArguments(
f"my-model-train",
push_to_hub=True,
evaluation_strategy="steps",
#per_device_train_batch_size=batch_size,
#per_device_eval_batch_size=batch_size,
auto_find_batch_size=True,
predict_with_generate=True,
logging_steps=5000,
save_steps=5000,
eval_steps=20_000,
warmup_steps=500,
max_steps=max_each,
# overwrite_output_dir=True,
save_total_limit=5,
metric_for_best_model="bleu"
)
trainer = Seq2SeqTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=ds_train.with_format("torch"),
eval_dataset=ds_valid.with_format("torch"),
data_collator=data_collator,
compute_metrics=compute_metrics,
)
try:
trainer.train(resume_from_checkpoint=True); trainer.push_to_hub("my-model-train")
except:
trainer.train(); trainer.push_to_hub("my-model-train")
There’s the save_strategy
but it seems to be working only locally and not pushing to hubs the checkpoints such that they can be loaded with trainer.train(resume_from_checkpoint=True)
, is that right? If so, is there an option like push_strategy
or something to “Push to hub is done every save_steps” ?
save_strategy (str or IntervalStrategy, optional, defaults to "steps") –
The checkpoint save strategy to adopt during training. Possible values are:
- "no": No save is done during training.
- "epoch": Save is done at the end of each epoch.
- "steps": Save is done every save_steps.