Trainer.train() hangs with multiple GPUs (but GPUs show activity)

I’m going through the huggingface tutorials and going through the “Training a causal language model from scratch” sections.

That page says “If you have access to a machine with multiple GPUs, try to run the code there. The Trainer automatically manages multiple machines, and this can speed up training tremendously.”

I’m working on a machine with 8 GPUs, and unfortunately, it doesn’t seem to work for me. When I set the CUDA_VISIBLE_DEVICES environment variable to “0”, it works perfectly fine on one GPU. However, if I don’t set that environment variable, or set it to more than 1 GPU, the following happens

  • All the relevant GPUs specified in CUDA_VISIBLE_DEVICES seem to fire at 100% utitlization
  • The tqdm progress bar in my script doesn’t move, even after hours (it moves quite quickly with one GPU)

So it seems like the jobs are correctly sent to the GPUs (I imagine…) but the central process isn’t… receiving the answers?

I assume there’s some additional config I need, but I can’t for the life of me find anythig in the docs that says what. I tried to run accelerate config on the command line, but no good.

Also, when I print trainer.args.parallel_mode right before training, I get ParallelMode.NOT_DISTRIBUTED.

Thanks a million for the help!!

2 Likes

Did you find an answer?

1 Like

There seem to be many problems that occur with trainers when using multiple GPUs.
https://discuss.huggingface.co/search?q=multi%20gpu%20trainer%20order%3Alatest

Unfortunately I never did - apologies :frowning: I ended up moving to another server with 8 GPUs, and it magically work, so I can only assume it was something to do with the way the 8 GPUs were set up on the first machine.

1 Like

Is this the kind of hardware problem? Anyway, I guess there was a workaround.