Hi I’m trying to fine-tune model with Trainer in transformers,
Well, I want to use a specific number of GPU in my server.
My server has two GPUs,(index 0, index 1) and I want to train my model with GPU index 1.
I’ve read the Trainer and TrainingArguments documents, and I’ve tried the CUDA_VISIBLE_DEVICES thing already. but it didn’t worked for me.
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
(I did it within Jupyter, before I import all libraries)
It gave me a runtime error when the trainer tries to initiate self.model = model.to(args.device) line.
and the error says like RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable.
I’ve also tried torch.cuda.set_device(1), it also didn’t work.
I don’t know how to set it up. It seems like I don’t have any options in argument of class
Please help me to handle this problem.
Thank you.
pyNVML is kinda terrible but if you can’t access the GPU using pyNVML then the problem is with your Python / Jupyter, not torch or any of the libraries that build on it. !nvidia-smi is not enough to verify that since the command is just sent directly to bash.
Here is the website.
EDIT: But it sounds like the GPU is busy. Are you sure you don’t have another Jupyter session running? Even if it is done training it won’t release the GPU until you shut down the kernel.