Setting specific device for Trainer

Hi I’m trying to fine-tune model with Trainer in transformers,

Well, I want to use a specific number of GPU in my server.
My server has two GPUs,(index 0, index 1) and I want to train my model with GPU index 1.

I’ve read the Trainer and TrainingArguments documents, and I’ve tried the CUDA_VISIBLE_DEVICES thing already. but it didn’t worked for me.

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

(I did it within Jupyter, before I import all libraries)
It gave me a runtime error when the trainer tries to initiate self.model = model.to(args.device) line.
and the error says like RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable.

trainer = Trainer(
 model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset)

I’ve also tried torch.cuda.set_device(1), it also didn’t work.
I don’t know how to set it up. It seems like I don’t have any options in argument of class
Please help me to handle this problem.
Thank you.

If torch.cuda_set_device(1) doesn’t work, the problem is in your install. Does the command nvidia-smi show up two GPUs?

1 Like

Yes.
My code is here

 torch.cuda.set_device(1)
 torch.cuda.current_device()
 1

epochs = 3
training_args = TrainingArguments(
do_predict=True,
output_dir=f'./results',
overwrite_output_dir=True,
do_train=True,
num_train_epochs=epochs,
per_device_train_batch_size=190,
logging_steps=10,
learning_rate=5e-05,
warmup_steps=500, 
save_total_limit = 100,
logging_dir='./logs',
save_steps=50)

training_args.device

out is

 device(type='cuda', index=0)

and

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset)

!nvidia-smi


I can see two GPUs
and nvidia-smi shows that the process is running with GPU #0 still.
It seems that torch.cuda.set_device(1) doens’t work at all.

Mmmm, looking at the code, I can see that Trainer ignores the default device indeed. Will fix that soon.

3 Likes

I got a discussion with my mate, He suggested to reinstall GPU driver or restart the server.
I’ll try it and tell what happens to you.
Thanks a lot.

1 Like

Is it implemented?, I could not figure out how to set device one for training i tried above methods.

2 Likes

Hi I am experiencing the same issue when using setfit is there any solution?

Hi @sgugger ,
Did you fix this bug?
Today I faced with the same issue when I set visible GPU to 1, but GPU 0 is still used

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

image

Note: I’m using the latest transformers module in python 3.7

I am having similar issue. Did you manage to get it work?