Hello everyone!
I hope you are all doing well.
I am currently training some LoRAs (via PEFT) using the Transformers Trainer. The Huggingface Hub was down this evening for less than an hour, but that caused my training to crash. This despite the fact that I am not pushing my model to the Hub and I am saving my checkpoints locally.
Please bear with me while I try to explain the chain of events:
- Whenever the trainer is saving a checkpoint it calls
Trainer._save()
. - Within this function there is a
model.save_pretrained()
call. - Because I’m using a peft model, this call corresponds to the function in
PeftModel
. - There, we can find a call to
get_peft_model_state_dict()
frompeft.utils
. - This call in turns includes a call to
file_exists()
fromhuggingface_hub
- Since the hub was down while the checkpoint was trying to be saved (locally) I got the following error:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out.
My question is: Is there a way of disabling this constant calling to the Huggingface_Hub during the Trainer
saving phase? Perhaps a way to make the file_exists()
check on a cached copy of the model. In theory my training loop should be able to be run completely offline once the model was originally downloaded, but this dependency on checking the hub is making the process very dependent on a stable connection for what seems to me like a very simple file check.
I did not post this as an Issue because I think it’s probably just a configuration fail on my configuration of the trainer rather than a shortcoming of the libraries.
I’ll tag @muellerzr since he is listed as a contact for Trainer
Issues on the Transformers github.
Thank you in advance for any help.