I’m training my own prompt-tuning model using transformers package. I’m following the training framework in the official example to train the model. I’m training environment is the one-machine-multiple-gpu setup. My current machine has 8 gpu cards and I only want to use some of them. However, the
Accelerator fails to work properly. It just puts everything on gpu:0, so I cannot use mutliple gpus. Also,
os.environ['CUDA_VISIBLE_DEVICES'] fails to work.
I have re-written the code without using
Accelerator. Instead, I use
os.environ['CUDA_VISIBLE_DEVICES'] to specify the gpus. Everything work fine in this case.
So what’s the reason? According the manual, I think
Accelerator should be able to take care of all these things. Thank you so much for your help!
FYI, here is the version information:
NVIDIA gpu cluster
Accelerator does not use
DataParallel on purpose since it’s not recommended by PyTorch. Have you properly set up your config in
accelerate config and launched your script with
Alternatively, did you launch you script with
python -m torch.distributed.launch ...? See more commands here.
Thanks for you reply! I tried to use
accelerate config, but I haven’t found a place to specify the gpu cards that I want to use. For example, if I set
nproc_per_node to 4, it will automatically use gpu:0, gpu:1, gpu:2, gpu:3 on my machine. Is there a way to change this behavior?
Thank you so much~
No, you will also need to add CUDA_VISIBLE_DEVICES=“0, 1, 2, 3” when launching, to use those four GPUS.
Yes, I actually done this by setting
os.environ['CUDA_VISIBLE_DEVICES'] = "3,4,5,6" at the beginning of my code. But it doesn’t work. Did I miss anything?
Thank you for your help!
No it needs to be done before the lauching command:
CUDA_VISIBLE_DEVICES = "3,4,5,6" accelerate launch training_script.py
Still fails to work correctly
Why do you say that? It seems good to me.
Oach, sorry. I just check the gpu state. It’s great. I just stupidly thought the
Device should show
cuda:3/4/5/6 (it shouldn’t of course since only 4 gpus are visible).
Thank you so much for your quick reply. Your help really save me since it’s my first time to use
Yes, you can’t trust completely the devices printed
Sir, I have this error. Can you please suggest me the solution of this error
GPUAccelerator can not run on your system since the accelerator is not available. The following accelerator(s) is available and can be passed into
accelerator argument of