How to use specified GPUs with Accelerator to train the model?

I’m training my own prompt-tuning model using transformers package. I’m following the training framework in the official example to train the model. I’m training environment is the one-machine-multiple-gpu setup. My current machine has 8 gpu cards and I only want to use some of them. However, the Accelerator fails to work properly. It just puts everything on gpu:0, so I cannot use mutliple gpus. Also, os.environ['CUDA_VISIBLE_DEVICES'] fails to work.
I have re-written the code without using Accelerator. Instead, I use nn.Dataparallel with os.environ['CUDA_VISIBLE_DEVICES'] to specify the gpus. Everything work fine in this case.
So what’s the reason? According the manual, I think Accelerator should be able to take care of all these things. Thank you so much for your help!

FYI, here is the version information:
python 3.6.8
transformers 3.4.0
accelerate 0.5.1
NVIDIA gpu cluster

Accelerator does not use DataParallel on purpose since it’s not recommended by PyTorch. Have you properly set up your config in accelerate config and launched your script with accelerate launch?

Alternatively, did you launch you script with python -m torch.distributed.launch ...? See more commands here.

Thanks for you reply! I tried to use accelerate config, but I haven’t found a place to specify the gpu cards that I want to use. For example, if I set nproc_per_node to 4, it will automatically use gpu:0, gpu:1, gpu:2, gpu:3 on my machine. Is there a way to change this behavior?
Thank you so much~

No, you will also need to add CUDA_VISIBLE_DEVICES=“0, 1, 2, 3” when launching, to use those four GPUS.

Yes, I actually done this by setting os.environ['CUDA_VISIBLE_DEVICES'] = "3,4,5,6" at the beginning of my code. But it doesn’t work. Did I miss anything?
Thank you for your help!

No it needs to be done before the lauching command:

CUDA_VISIBLE_DEVICES = "3,4,5,6" accelerate launch training_script.py


Still fails to work correctly :no_mouth:

Why do you say that? It seems good to me.

Oach, sorry. I just check the gpu state. It’s great. I just stupidly thought the Device should show cuda:3/4/5/6 (it shouldn’t of course since only 4 gpus are visible).
Thank you so much for your quick reply. Your help really save me since it’s my first time to use accelerate package.

Yes, you can’t trust completely the devices printed :slight_smile:

Sir, I have this error. Can you please suggest me the solution of this error
GPUAccelerator can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into accelerator argument of Trainer: [‘cpu’].

Hi,I also have the same error as yours. Have you found the solution? Hope for your reply!

Hi,

What the second floor says works for me. A workable command looks like this

CUDA_VISIBLE_DEVICES={gpus you gonna use} python -m torch.distributed.launch --nproc_per_node={the number of gpu used} \
  your_python_script.py {other arguments for your python script}

Sorry for the late reply, hope you have solved the problem.

1 Like

accelerate launch also now lets you specify --gpu_ids as a string too :slight_smile:

2 Likes

Why don’t you just make our life easier and simply add a parameter to the trainer to get the GPU id or list of ids?