Ideally we shouldn’t do this, but if you can modify the last else statement from
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
to
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
in the _setup_devices
function on training_args.py
file in the transformer library.
I did that too, and I transfer all data to the device. Still when I use trainer.train(). This bug comes up:
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2
Yes, you’re correct @yaanhaan, but if you pass this argument into the Trainer, It’ll use it for TrainingArguments that’ll set the given device for you run.
It worked for me BTW
This works for me:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import torch
You have to set the env variable before importing torch. Now torch only sees device 1, not 0.
If you run this:
print(torch.cuda.current_device())
print(torch.cuda.is_available())
Torch would still think it is using device 0 but from the command nvidia-smi
, it is using device 1.
3 Likes
Thank you, @josejames00, you saved my day with this solution
You are my hero