How to disable Huggingface Hub during Trainer saving of PEFT models?

Hello everyone!

I hope you are all doing well.

I am currently training some LoRAs (via PEFT) using the Transformers Trainer. The Huggingface Hub was down this evening for less than an hour, but that caused my training to crash. This despite the fact that I am not pushing my model to the Hub and I am saving my checkpoints locally.

Please bear with me while I try to explain the chain of events:

  1. Whenever the trainer is saving a checkpoint it calls Trainer._save().
  2. Within this function there is a model.save_pretrained() call.
  3. Because I’m using a peft model, this call corresponds to the function in PeftModel.
  4. There, we can find a call to get_peft_model_state_dict() from peft.utils.
  5. This call in turns includes a call to file_exists() from huggingface_hub
  6. Since the hub was down while the checkpoint was trying to be saved (locally) I got the following error:
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out.

My question is: Is there a way of disabling this constant calling to the Huggingface_Hub during the Trainer saving phase? Perhaps a way to make the file_exists() check on a cached copy of the model. In theory my training loop should be able to be run completely offline once the model was originally downloaded, but this dependency on checking the hub is making the process very dependent on a stable connection for what seems to me like a very simple file check.

I did not post this as an Issue because I think it’s probably just a configuration fail on my configuration of the trainer rather than a shortcoming of the libraries.

I’ll tag @muellerzr since he is listed as a contact for Trainer Issues on the Transformers github.

Thank you in advance for any help.

You can probably disable it during training, iirc it’s TRANSFORMERS_OFFLINE=1. However there should be an option for turning off the hf_hub upload in the TrainingArguments

1 Like

Hello and thank you for the quick answer! I will give this a try.

1 Like