Trainer, device error cuda:0 and cuda:1

FreddyMro · January 15, 2024, 5:08pm

Hi, I want to train a model using the Trainer. In the model_init function, I instantiate a model and perform heavy calculations for experimental weight initialization. These calculations should be performed on a cuda device which is why I instantiate my model on a cuda device by manually moving parameters onto cuda (not cuda:0 or cuda:1).

My system has two cuda cards which I both want to use. When I run the training (trainer.train()), I get the error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0.

I expected the trainer to take care of data parallelism. When I instantiate the model without moving parameters to cuda (which takes an eternity), everything works fine.

Could somebody please share any insights into how I can use cuda for model initialization without running into this error during training? Many thanks!

rachel-sunshine · January 16, 2024, 8:26pm

I faced this error once when using yolov.

Do this -
os.environ[“CUDA_VISIBLE_DEVICES”]=“0”
at the top

FreddyMro · January 17, 2024, 3:01pm

Thank you for your suggestion! That would indeed work but turn off my second GPU entirely which I do not want

But I found a funny solution which I do not entirely understand but I’ll share for those having the same issue.

Somewhere in my model I have code along the lines of:

some_weight = nn.Parameter(any_random_weight).to(device)

Changing this to

some_weight = nn.Parameter(any_random_weight.to(device))

i.e. moving the tensor to cuda before making it a parameter (as supposed to the other way round) did the trick! I do not know why but I’ll gratefully accept it’s working now. device was set to "cuda" in my case.

system · January 18, 2024, 3:02am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Trainer.train throws RuntimeError: Expected all tensors to be on the same device Beginners	5	3322	May 17, 2023
How to set the training device to cuda:1 ? By default, TrainerArgument seems to move model to cuda:0 Beginners	3	296	September 25, 2024
Setting specific device for Trainer Beginners	25	41657	July 21, 2024
Can I use CUDA with Trainer.train? Beginners	3	7890	May 10, 2022
Trainer.evalute() with multi GPUs results Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! Beginners	2	75	February 11, 2025

Trainer, device error cuda:0 and cuda:1

Related topics