I’m training plugins (e.g. adapter) on top of the language model on multiple GPUs using huggingface Accelerate. However, strange things occur when I try to load additional parameters into the model. The training cannot move on successfully and I find the process state not as expected.
My running command is like this:
CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nproc_per_node 4 --use_env ./mycode.py
But I found the process state like this:
I don’t know why there will be three processes on GPU1. This is definitely not correct.
The code works well if I use a single GPU or if I don’t load the additional parameters. So I think there is no problem locating the bug.
FYI, the code related to loading additional parameters is as follows:
model = MyRobertaModel.from_pretrained(
args.model_name_or_path,
model_args=model_args,
data_args=data_args,
training_args=training_args,
n_tokens=args.n_tokens,
)
accelerator.wait_for_everyone() # I try to add barrier but it doesn't solve my problem
t = args.t
if t > 0:
embed_pool = torch.load(os.path.join(args.saved_plugin_dir, 'embed_pool.pth'))
for i in range(t):
model.add_prompt_embedding(t=i, saved_embedding=embed_pool[i])
plugin_ckpt = torch.load(os.path.join(args.saved_plugin_dir, 'plugin_ckpt.pth'))
model.load_plugin(plugin_ckpt)
accelerator.wait_for_everyone()
model.resize_token_embeddings(len(tokenizer))
The code for loading plugin looks like this
def load_plugin(self, plugin_ckpt):
idx = 0
for name, sub_module in super().named_modules():
if isinstance(sub_module, MyAdapter):
sub_module.adapter.load_state_dict(plugin_ckpt[f'adapter_{idx}'])
idx += 1
print('Load plugins successfully!')
Also, my library versions are:
python 3.6.8
transformers 4.11.3
accelerate 0.5.1
NVIDIA gpu cluster
Really thank you for your help!