Setting specific device for Trainer

ben9004 · August 20, 2020, 12:11pm

Hi I’m trying to fine-tune model with Trainer in transformers,

Well, I want to use a specific number of GPU in my server.
My server has two GPUs,(index 0, index 1) and I want to train my model with GPU index 1.

I’ve read the Trainer and TrainingArguments documents, and I’ve tried the CUDA_VISIBLE_DEVICES thing already. but it didn’t worked for me.

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

(I did it within Jupyter, before I import all libraries)
It gave me a runtime error when the trainer tries to initiate self.model = model.to(args.device) line.
and the error says like RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable.

trainer = Trainer(
 model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset)

I’ve also tried torch.cuda.set_device(1), it also didn’t work.
I don’t know how to set it up. It seems like I don’t have any options in argument of class
Please help me to handle this problem.
Thank you.

sgugger · August 20, 2020, 12:37pm

If torch.cuda_set_device(1) doesn’t work, the problem is in your install. Does the command nvidia-smi show up two GPUs?

ben9004 · August 21, 2020, 2:08am

Yes.
My code is here

 torch.cuda.set_device(1)
 torch.cuda.current_device()
 1

epochs = 3
training_args = TrainingArguments(
do_predict=True,
output_dir=f'./results',
overwrite_output_dir=True,
do_train=True,
num_train_epochs=epochs,
per_device_train_batch_size=190,
logging_steps=10,
learning_rate=5e-05,
warmup_steps=500, 
save_total_limit = 100,
logging_dir='./logs',
save_steps=50)

training_args.device

out is

 device(type='cuda', index=0)

and

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset)

!nvidia-smi

I can see two GPUs
and nvidia-smi shows that the process is running with GPU #0 still.
It seems that torch.cuda.set_device(1) doens’t work at all.

sgugger · August 21, 2020, 1:10pm

Mmmm, looking at the code, I can see that Trainer ignores the default device indeed. Will fix that soon.

ben9004 · August 22, 2020, 3:03am

I got a discussion with my mate, He suggested to reinstall GPU driver or restart the server.
I’ll try it and tell what happens to you.
Thanks a lot.

bhadresh-savani · May 6, 2021, 5:38am

Is it implemented?, I could not figure out how to set device one for training i tried above methods.

FBH · October 25, 2022, 5:49am

Hi I am experiencing the same issue when using setfit is there any solution?

ithieund · November 29, 2022, 8:12pm

Hi @sgugger ,
Did you fix this bug?
Today I faced with the same issue when I set visible GPU to 1, but GPU 0 is still used

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

Note: I’m using the latest transformers module in python 3.7

debashis · March 1, 2023, 1:10pm

I am having similar issue. Did you manage to get it work?

ndvb · April 10, 2023, 12:34pm

Same issue. Why when loading a model using .from_pretrained we can specify device_map, but we can’t specify device_map when defining a Trainer?

tumbleweed · May 18, 2023, 9:28pm

This issue still exists. Thanks.

Harsh786 · May 25, 2023, 7:34am

Meanwhile, this can do a trick:

trainer = Trainer(
    model = model,
    .
    .
    device=0 # id of the device to use (int) -> for GPU(s): 0, 1 or for CPU: -1
)

yaanhaan · June 14, 2023, 6:12pm

sorry to bother you, but what do you mean here?
Trainer doesn’t have a parameter device

noobmldude · June 15, 2023, 4:13pm

Is there any solution to this?

Still can’t get Trainer to use a particular device in transformers v4.30.2 even after setting environment variable

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]= "2"

It still takes GPU device 0

[In]: trainer.args.device
[Out] :device(type='cuda', index=0)

rodio · July 10, 2023, 4:19am

The same issue still exist, literally dealing with this right now

grofte · July 21, 2023, 8:48am

Really? I just did

import torch
torch.cuda.set_device(1)

and it works flawlessly. Doesn’t fix that the models are pretty big relative to my GPU memory and I have to use quite small batch size

Using torch==2.0.1 and SetFit library.

LoveLord · July 30, 2023, 1:47am

It still doesn’t work. Any updates? Thanks!

grofte · August 31, 2023, 1:11pm

pyNVML is kinda terrible but if you can’t access the GPU using pyNVML then the problem is with your Python / Jupyter, not torch or any of the libraries that build on it. !nvidia-smi is not enough to verify that since the command is just sent directly to bash.

Here is the website.

EDIT: But it sounds like the GPU is busy. Are you sure you don’t have another Jupyter session running? Even if it is done training it won’t release the GPU until you shut down the kernel.

josejames00 · October 5, 2023, 9:07pm

A solution that works for me:

import os
os.environ[“CUDA_DEVICE_ORDER”]=“PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”]=“0”

But import this FIRST, before anything from the pytorch or transformer libraries.

LeoTungAnh · October 13, 2023, 7:19am

Thanks @josejames00, your solution works to me.

Topic		Replies	Views
How to set gpu device for hugging trainer? 🤗Transformers	1	1095	September 16, 2024
How to restrict training to one GPU if multiple are available, co 🤗Transformers	4	14377	November 1, 2023
How to set the training device to cuda:1 ? By default, TrainerArgument seems to move model to cuda:0 Beginners	3	324	September 25, 2024
Can't set attribute 'device', for some reason i need to train model on only one gpu on a mutlti gpu machine Beginners	2	411	September 16, 2024
Limit GPU cores for training 🤗Transformers	4	1543	September 14, 2023

Setting specific device for Trainer

Related topics