Error occurs when saving model in multi-gpu settings

EchoShao8899 · November 5, 2021, 5:19pm

I’m finetuning a language model on multiple gpus. However, I met some problems with saving the model. After saving the model using .save_pretrained(output_dir), I tried to load the saved model using .from_pretrained(output_dir), but got the following error message.

OSError: Unable to load weights from pytorch checkpoint file for ‘xxx’ at 'my_model_dir’ If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

This error is strange because I check the model_dir, there are config.json file and pytorch_model.bin file in it. Also, obviously, I’m not doing anything with TF, so ther instruction in the Error is not instructive.

Currently, I’m using accelerate library to do the training in multi-gpu settings. And the relevant code for saving the model is as follows:

accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(args.output_dir) 
    # torch.save(unwrapped_model.state_dict(), args.output_dir+'.pth')
    if accelerator.is_main_process:
        tokenizer.save_pretrained(args.output_dir)

I saw similar problems on the Internet but haven’t found useful solutions. I think the problem lies in the multi-gpu setting, because if in single gpu setting, everything works fine.

FYI, here is my environment information:
python 3.6.8
transformers 3.4.0
accelerate 0.5.1
NVIDIA gpu cluster

Not sure if I miss anything important in multi-gpu setting. Really thanks for your help!

sgugger · November 8, 2021, 1:06pm

Is your training a multinode training? What may have happened is that you saved the model on the main process only, so only on one machine. The other machines then don’t find your model when you try to load it.

You can use the is_main_local_process attribute of the accelerator to save once per machine.

EchoShao8899 · November 9, 2021, 1:23pm

No. My case is one-machine-multiple-gpu. On this occasion, is if accelerator.is_main_process: necessary or should I just delete it? (Sry, I’m not so familiar with the mechanism of accelerate package.)

sgugger · November 9, 2021, 2:02pm

In that case nake sure there is an accelerator.wait_for_everyone() before the from_pretrained, as maybe one process is trying to access the weights before the main process has saved them.

EchoShao8899 · November 9, 2021, 2:29pm

I’ll have a try. Thank you for your help!

Topic		Replies	Views
What is the right way to save check point using accelerator while trainining on multiple gpus? 🤗Accelerate	2	1947	January 24, 2024
Error when saving model in accelerate 🤗Accelerate	5	4035	April 13, 2023
Loading custom class model instance saved using accelerate library fails 🤗Accelerate	3	1083	July 29, 2021
Saving optimizer 🤗Accelerate	19	6659	May 18, 2023
Multi node CPU to train transformer GPT-JT-6B-v1 (moved) 🤗Transformers	0	422	February 20, 2023

Error occurs when saving model in multi-gpu settings

Related topics