Setting specific device for Trainer

Hi I’m trying to fine-tune model with Trainer in transformers,

Well, I want to use a specific number of GPU in my server.
My server has two GPUs,(index 0, index 1) and I want to train my model with GPU index 1.

I’ve read the Trainer and TrainingArguments documents, and I’ve tried the CUDA_VISIBLE_DEVICES thing already. but it didn’t worked for me.

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

(I did it within Jupyter, before I import all libraries)
It gave me a runtime error when the trainer tries to initiate self.model = model.to(args.device) line.
and the error says like RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable.

trainer = Trainer(
 model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset)

I’ve also tried torch.cuda.set_device(1), it also didn’t work.
I don’t know how to set it up. It seems like I don’t have any options in argument of class
Please help me to handle this problem.
Thank you.

1 Like

If torch.cuda_set_device(1) doesn’t work, the problem is in your install. Does the command nvidia-smi show up two GPUs?

1 Like

Yes.
My code is here

 torch.cuda.set_device(1)
 torch.cuda.current_device()
 1

epochs = 3
training_args = TrainingArguments(
do_predict=True,
output_dir=f'./results',
overwrite_output_dir=True,
do_train=True,
num_train_epochs=epochs,
per_device_train_batch_size=190,
logging_steps=10,
learning_rate=5e-05,
warmup_steps=500, 
save_total_limit = 100,
logging_dir='./logs',
save_steps=50)

training_args.device

out is

 device(type='cuda', index=0)

and

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset)

!nvidia-smi


I can see two GPUs
and nvidia-smi shows that the process is running with GPU #0 still.
It seems that torch.cuda.set_device(1) doens’t work at all.

1 Like

Mmmm, looking at the code, I can see that Trainer ignores the default device indeed. Will fix that soon.

4 Likes

I got a discussion with my mate, He suggested to reinstall GPU driver or restart the server.
I’ll try it and tell what happens to you.
Thanks a lot.

1 Like

Is it implemented?, I could not figure out how to set device one for training i tried above methods.

2 Likes

Hi I am experiencing the same issue when using setfit is there any solution?

Hi @sgugger ,
Did you fix this bug?
Today I faced with the same issue when I set visible GPU to 1, but GPU 0 is still used

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

image

Note: I’m using the latest transformers module in python 3.7

I am having similar issue. Did you manage to get it work?

Same issue. Why when loading a model using .from_pretrained we can specify device_map, but we can’t specify device_map when defining a Trainer?

This issue still exists. Thanks.

Meanwhile, this can do a trick:

trainer = Trainer(
    model = model,
    .
    .
    device=0 # id of the device to use (int) -> for GPU(s): 0, 1 or for CPU: -1
)

sorry to bother you, but what do you mean here?
Trainer doesn’t have a parameter device

1 Like

Is there any solution to this?

Still can’t get Trainer to use a particular device in transformers v4.30.2 even after setting environment variable

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]= "2"

It still takes GPU device 0

[In]: trainer.args.device
[Out] :device(type='cuda', index=0)

The same issue still exist, literally dealing with this right now

Really? I just did

import torch
torch.cuda.set_device(1)

and it works flawlessly. Doesn’t fix that the models are pretty big relative to my GPU memory and I have to use quite small batch size :smile:

Using torch==2.0.1 and SetFit library.

1 Like

It still doesn’t work. Any updates? Thanks!

pyNVML is kinda terrible but if you can’t access the GPU using pyNVML then the problem is with your Python / Jupyter, not torch or any of the libraries that build on it. !nvidia-smi is not enough to verify that since the command is just sent directly to bash.

Here is the website.

EDIT: But it sounds like the GPU is busy. Are you sure you don’t have another Jupyter session running? Even if it is done training it won’t release the GPU until you shut down the kernel.

A solution that works for me:

import os
os.environ[“CUDA_DEVICE_ORDER”]=“PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”]=“0”

But import this FIRST, before anything from the pytorch or transformer libraries.

4 Likes

Thanks @josejames00, your solution works to me.