I’m finetuning a language model on multiple gpus. However, I met some problems with saving the model. After saving the model using
.save_pretrained(output_dir), I tried to load the saved model using
.from_pretrained(output_dir), but got the following error message.
OSError: Unable to load weights from pytorch checkpoint file for ‘xxx’ at 'my_model_dir’ If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
This error is strange because I check the model_dir, there are
config.json file and
pytorch_model.bin file in it. Also, obviously, I’m not doing anything with TF, so ther instruction in the Error is not instructive.
Currently, I’m using accelerate library to do the training in multi-gpu settings. And the relevant code for saving the model is as follows:
accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained(args.output_dir) # torch.save(unwrapped_model.state_dict(), args.output_dir+'.pth') if accelerator.is_main_process: tokenizer.save_pretrained(args.output_dir)
I saw similar problems on the Internet but haven’t found useful solutions. I think the problem lies in the multi-gpu setting, because if in single gpu setting, everything works fine.
FYI, here is my environment information:
NVIDIA gpu cluster
Not sure if I miss anything important in multi-gpu setting. Really thanks for your help!