Error occurs when saving model in multi-gpu settings

I’m finetuning a language model on multiple gpus. However, I met some problems with saving the model. After saving the model using .save_pretrained(output_dir), I tried to load the saved model using .from_pretrained(output_dir), but got the following error message.

OSError: Unable to load weights from pytorch checkpoint file for ‘xxx’ at 'my_model_dir’ If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

This error is strange because I check the model_dir, there are config.json file and pytorch_model.bin file in it. Also, obviously, I’m not doing anything with TF, so ther instruction in the Error is not instructive.

Currently, I’m using accelerate library to do the training in multi-gpu settings. And the relevant code for saving the model is as follows:

accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(args.output_dir) 
    # torch.save(unwrapped_model.state_dict(), args.output_dir+'.pth')
    if accelerator.is_main_process:
        tokenizer.save_pretrained(args.output_dir)

I saw similar problems on the Internet but haven’t found useful solutions. I think the problem lies in the multi-gpu setting, because if in single gpu setting, everything works fine.

FYI, here is my environment information:
python 3.6.8
transformers 3.4.0
accelerate 0.5.1
NVIDIA gpu cluster

Not sure if I miss anything important in multi-gpu setting. Really thanks for your help!

Is your training a multinode training? What may have happened is that you saved the model on the main process only, so only on one machine. The other machines then don’t find your model when you try to load it.

You can use the is_main_local_process attribute of the accelerator to save once per machine.

No. My case is one-machine-multiple-gpu. On this occasion, is if accelerator.is_main_process: necessary or should I just delete it? (Sry, I’m not so familiar with the mechanism of accelerate package.)

In that case nake sure there is an accelerator.wait_for_everyone() before the from_pretrained, as maybe one process is trying to access the weights before the main process has saved them.

I’ll have a try. Thank you for your help!