Setting specific device for Trainer

Ideally we shouldn’t do this, but if you can modify the last else statement from
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") to
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
in the _setup_devices function on training_args.py file in the transformer library.

I did that too, and I transfer all data to the device. Still when I use trainer.train(). This bug comes up:
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

Yes, you’re correct @yaanhaan, but if you pass this argument into the Trainer, It’ll use it for TrainingArguments that’ll set the given device for you run.

It worked for me BTW :smiley:

This works for me:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import torch

You have to set the env variable before importing torch. Now torch only sees device 1, not 0.

If you run this:

print(torch.cuda.current_device())
print(torch.cuda.is_available())

Torch would still think it is using device 0 but from the command nvidia-smi, it is using device 1.

3 Likes

Thank you, @josejames00, you saved my day with this solution

You are my hero