Resuming Training from Checkpoints Stored on Hugging Face Hub (without downloading manually)

Thanks for your response! I appreciate the links.

I might be misunderstanding something, but from what I see, the examples you shared assume that the checkpoint is stored locally. My question is more about whether Trainer can resume training directly from a checkpoint stored on the Hugging Face Hub, without manually downloading the files first(e.g., using hf_hub_download).

In my above example, I’d like to resume training from checkpoint-5 located on my Hub at my-user/my-private-model/checkpoint-5.

From my testing, it seems like Trainer’s resume_from_checkpoint only works when the checkpoint is already on the local filesystem, which is why I had to manually fetch it from the Hub before resuming. If there’s a way to do this more seamlessly, I’d love to know!

Let me know if I’m missing something - maybe I’m overcomplicating it. Thanks again! :blush:

1 Like