Error occurs when loading additional parameters in multi-gpu training

EchoShao8899 · December 9, 2021, 2:05pm

I’m training plugins (e.g. adapter) on top of the language model on multiple GPUs using huggingface Accelerate. However, strange things occur when I try to load additional parameters into the model. The training cannot move on successfully and I find the process state not as expected.

My running command is like this:

 CUDA_VISIBLE_DEVICES=1,2,3,4 python -m torch.distributed.launch --nproc_per_node 4 --use_env ./mycode.py

But I found the process state like this:

I don’t know why there will be three processes on GPU1. This is definitely not correct.

The code works well if I use a single GPU or if I don’t load the additional parameters. So I think there is no problem locating the bug.

FYI, the code related to loading additional parameters is as follows:

 model = MyRobertaModel.from_pretrained(
        args.model_name_or_path,
        model_args=model_args,
        data_args=data_args,
        training_args=training_args,
        n_tokens=args.n_tokens,
    )

    accelerator.wait_for_everyone()  # I try to add barrier but it doesn't solve my problem
    t = args.t
    if t > 0:
        embed_pool = torch.load(os.path.join(args.saved_plugin_dir, 'embed_pool.pth'))
        for i in range(t):
            model.add_prompt_embedding(t=i, saved_embedding=embed_pool[i])
        plugin_ckpt = torch.load(os.path.join(args.saved_plugin_dir, 'plugin_ckpt.pth'))
        model.load_plugin(plugin_ckpt)

    accelerator.wait_for_everyone()

    model.resize_token_embeddings(len(tokenizer))

The code for loading plugin looks like this

    def load_plugin(self, plugin_ckpt):
        idx = 0
        for name, sub_module in super().named_modules():
            if isinstance(sub_module, MyAdapter):
                sub_module.adapter.load_state_dict(plugin_ckpt[f'adapter_{idx}'])
                idx += 1

        print('Load plugins successfully!')

Also, my library versions are:
python 3.6.8
transformers 4.11.3
accelerate 0.5.1
NVIDIA gpu cluster

Really thank you for your help!

EchoShao8899 · December 14, 2021, 3:35am

Turn out to be a stupid mistake by me.

embed_pool = torch.load(os.path.join(args.saved_plugin_dir, 'embed_pool.pth'))

should be changed to

embed_pool = torch.load(os.path.join(args.saved_plugin_dir, 'embed_pool.pth'), map_location=torch.device('cpu'))

torch.load() will automatically map the file to device:0, and this device is the same in the eye of each device, thus causing the problem (spawning additional processes in device:0).

Mark this problem as solved by myself.

Topic		Replies	Views
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9669	October 16, 2024
HF Accelerate uses multiple GPUs even when setting `num_processes` to 1 🤗Accelerate	0	91	August 2, 2024
Loading two models onto two gpus Intermediate	0	1021	March 31, 2023
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6497	May 31, 2023
Missing positional arguments when try to use multiple GPUs with accelerator 🤗Accelerate	4	2071	May 11, 2021

Error occurs when loading additional parameters in multi-gpu training

Related topics