Is it possible to push_to_hub at every checkpoint?

When training a model with something like:


training_args = Seq2SeqTrainingArguments(
    f"my-model-train",  
    push_to_hub=True,
    evaluation_strategy="steps",
    #per_device_train_batch_size=batch_size,
    #per_device_eval_batch_size=batch_size,
    auto_find_batch_size=True,
    predict_with_generate=True,
    logging_steps=5000,  
    save_steps=5000,    
    eval_steps=20_000, 
    warmup_steps=500, 
    max_steps=200_000,  
    # overwrite_output_dir=True,
    save_total_limit=5,
    metric_for_best_model="bleu"
)

trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=ds_train.with_format("torch"),
    eval_dataset=ds_valid.with_format("torch"),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    
)

trainer.train(); trainer.push_to_hub("my-model-train")

Sometimes the VM kernels dies before the end of max_steps during trainer.train() and the checkpoints don’t get pushed to hub. And when that happens, doing resume_from_checkpoint would not work.

model = AutoModelForSeq2SeqLM.from_pretrained("my-model-train")
trainer.train(resume_from_checkpoint=True)

Is it possible to set training arguments to push_to_hub at every save_steps or eval_steps and not just when then model finishes at max_steps?

Currently my workaround is doing something like:


for i in range(20):
  
  max_each = 20_000 * (i+1)

  training_args = Seq2SeqTrainingArguments(
    f"my-model-train",  
    push_to_hub=True,
    evaluation_strategy="steps",
    #per_device_train_batch_size=batch_size,
    #per_device_eval_batch_size=batch_size,
    auto_find_batch_size=True,
    predict_with_generate=True,
    logging_steps=5000,  
    save_steps=5000,    
    eval_steps=20_000, 
    warmup_steps=500, 
    max_steps=max_each,  
    # overwrite_output_dir=True,
    save_total_limit=5,
    metric_for_best_model="bleu"
)

  trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=ds_train.with_format("torch"),
    eval_dataset=ds_valid.with_format("torch"),
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    
)
  try:
    trainer.train(resume_from_checkpoint=True); trainer.push_to_hub("my-model-train")
  except:
    trainer.train(); trainer.push_to_hub("my-model-train")

There’s the save_strategy but it seems to be working only locally and not pushing to hubs the checkpoints such that they can be loaded with trainer.train(resume_from_checkpoint=True), is that right? If so, is there an option like push_strategy or something to “Push to hub is done every save_steps” ?

save_strategy (str or IntervalStrategy, optional, defaults to "steps") –

The checkpoint save strategy to adopt during training. Possible values are:
 - "no": No save is done during training.
 - "epoch": Save is done at the end of each epoch.
 - "steps": Save is done every save_steps.

I face the same issue, my workaround has been downloading and pushing the checkpoint files from local to the hub. If you’ve found something else, let me know.

Lookup the hub_strategy=checkpoint setting.
This will save only the latest checkpoint, and also in the end.

There is also the all_checkpoints option to store all checkpoints