Using another model when training a model with accelerate on multi-GPUs

jin818 · September 13, 2021, 5:06pm

Hello
I am going to train a model using multi-GPUs on a single machine.
The problem is that another model is used when training the main model.
This second model is used ONLY for inference on the same data with the main model
I have 4 GPUs on my machine.
So I assigned 3 processes (3 GPUs) for training the main model in the accelerate configuration.
I am going to assign the last (4th) GPU device to the second model.
However, this results in NCCL communication issue when doing inference of the second model.
I did NOT get this reason yet.
Could u guide me?
Thanks

Elron · October 31, 2022, 11:07am

I was facing something similar (I think) I think that when you conduct inference on the second model it somehow changes the definition of the main device in the accelerator. In my case it resulted in the data loader suddenly started to put all the batches on the device dedicated to the second model

Topic		Replies	Views
Loading a HF Model in Multiple GPUs and Run Inferences in those GPUs 🤗Accelerate	10	9646	October 16, 2024
Multiple Model training on multiple GPUs 🤗Accelerate	1	1478	February 14, 2022
Problem with model inference using accelerate 🤗Accelerate	3	778	November 22, 2023
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6462	May 31, 2023
Accelarator can't detect my GPUs? 🤗Accelerate	10	1579	March 29, 2024

Using another model when training a model with accelerate on multi-GPUs

Related topics