Is it possible to push to HuggingFace on every checkpoint while training?

Hi all!

I’m trying to use one of the example scripts under the transformers library. In particular, this one:

At the moment, I’m using the MNIST dataset, just to keep things simple. I’ve been reading the documentation, but I can’t seem to figure out if it is possible. As per my title, I would like to push to HuggingFace on every checkpoint, and include any optimizer / scheduler internal data alongside the checkpoint, so I can pull from HuggingFace and pick up where I left off.

Is this something that is doable right now? Or should I write my own custom script to get this functionality?

Thanks in advance,

  • Farley

Just add push_to_hub=True to your TrainingArguments.

1 Like

@sgugger Thank you for the reply! Actually, what I’ve found is that push_to_hub=True launches a background git push. I think the problem is that sometimes the git push process hangs. Is there any way to debug that git process, to see why it does not finish pushing?

Also, just for context, I can sometimes go into the directory and use git push with success. But also, I’ve found that only the first checkpoint produces a git commit. While trying to debug the background git push, I noticed that sometimes there can be 3 or 4 checkpoints created, but no commits happening for them. So only the first checkpoint gets pushed.

Let me know if there’s any extra info I can provide. I might try and copy one of the run_*.py scripts and writing some custom Python code to solve this problem.