Trainer, device error cuda:0 and cuda:1

Hi, I want to train a model using the Trainer. In the model_init function, I instantiate a model and perform heavy calculations for experimental weight initialization. These calculations should be performed on a cuda device which is why I instantiate my model on a cuda device by manually moving parameters onto cuda (not cuda:0 or cuda:1).

My system has two cuda cards which I both want to use. When I run the training (trainer.train()), I get the error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0.

I expected the trainer to take care of data parallelism. When I instantiate the model without moving parameters to cuda (which takes an eternity), everything works fine.

Could somebody please share any insights into how I can use cuda for model initialization without running into this error during training? Many thanks!

I faced this error once when using yolov.

Do this -
os.environ[“CUDA_VISIBLE_DEVICES”]=“0”
at the top

Thank you for your suggestion! That would indeed work but turn off my second GPU entirely which I do not want :wink:

But I found a funny solution which I do not entirely understand but I’ll share for those having the same issue.

Somewhere in my model I have code along the lines of:

some_weight = nn.Parameter(any_random_weight).to(device)

Changing this to

some_weight = nn.Parameter(any_random_weight.to(device))

i.e. moving the tensor to cuda before making it a parameter (as supposed to the other way round) did the trick! I do not know why but I’ll gratefully accept it’s working now. device was set to "cuda" in my case.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.