Thanks for your response! I appreciate the links.
I might be misunderstanding something, but from what I see, the examples you shared assume that the checkpoint is stored locally. My question is more about whether Trainer
can resume training directly from a checkpoint stored on the Hugging Face Hub, without manually downloading the files first(e.g., using hf_hub_download
).
In my above example, I’d like to resume training from checkpoint-5
located on my Hub at my-user/my-private-model/checkpoint-5
.
From my testing, it seems like Trainer
’s resume_from_checkpoint
only works when the checkpoint is already on the local filesystem, which is why I had to manually fetch it from the Hub before resuming. If there’s a way to do this more seamlessly, I’d love to know!
Let me know if I’m missing something - maybe I’m overcomplicating it. Thanks again!