Trainer, device error cuda:0 and cuda:1

Hi, I want to train a model using the Trainer. In the model_init function, I instantiate a model and perform heavy calculations for experimental weight initialization. These calculations should be performed on a cuda device which is why I instantiate my model on a cuda device by manually moving parameters onto cuda (not cuda:0 or cuda:1).

My system has two cuda cards which I both want to use. When I run the training (trainer.train()), I get the error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0.

I expected the trainer to take care of data parallelism. When I instantiate the model without moving parameters to cuda (which takes an eternity), everything works fine.

Could somebody please share any insights into how I can use cuda for model initialization without running into this error during training? Many thanks!

I faced this error once when using yolov.

Do this -
os.environ[“CUDA_VISIBLE_DEVICES”]=“0”
at the top

Thank you for your suggestion! That would indeed work but turn off my second GPU entirely which I do not want :wink:

But I found a funny solution which I do not entirely understand but I’ll share for those having the same issue.

Somewhere in my model I have code along the lines of:

some_weight = nn.Parameter(any_random_weight).to(device)

Changing this to

some_weight = nn.Parameter(any_random_weight.to(device))

i.e. moving the tensor to cuda before making it a parameter (as supposed to the other way round) did the trick! I do not know why but I’ll gratefully accept it’s working now. device was set to "cuda" in my case.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.