I am going to train a model using multi-GPUs on a single machine.
The problem is that another model is used when training the main model.
This second model is used ONLY for inference on the same data with the main model
I have 4 GPUs on my machine.
So I assigned 3 processes (3 GPUs) for training the main model in the accelerate configuration.
I am going to assign the last (4th) GPU device to the second model.
However, this results in NCCL communication issue when doing inference of the second model.
I did NOT get this reason yet.
Could u guide me?