Hi, I want to train a model using the Trainer
. In the model_init
function, I instantiate a model and perform heavy calculations for experimental weight initialization. These calculations should be performed on a cuda device which is why I instantiate my model on a cuda device by manually moving parameters onto cuda
(not cuda:0
or cuda:1
).
My system has two cuda cards which I both want to use. When I run the training (trainer.train()), I get the error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0
.
I expected the trainer to take care of data parallelism. When I instantiate the model without moving parameters to cuda (which takes an eternity), everything works fine.
Could somebody please share any insights into how I can use cuda for model initialization without running into this error during training? Many thanks!