Trainer.train() hangs with multiple GPUs (but GPUs show activity)

I’m going through the huggingface tutorials and going through the “Training a causal language model from scratch” sections.

That page says “If you have access to a machine with multiple GPUs, try to run the code there. The Trainer automatically manages multiple machines, and this can speed up training tremendously.”

I’m working on a machine with 8 GPUs, and unfortunately, it doesn’t seem to work for me. When I set the CUDA_VISIBLE_DEVICES environment variable to “0”, it works perfectly fine on one GPU. However, if I don’t set that environment variable, or set it to more than 1 GPU, the following happens

  • All the relevant GPUs specified in CUDA_VISIBLE_DEVICES seem to fire at 100% utitlization
  • The tqdm progress bar in my script doesn’t move, even after hours (it moves quite quickly with one GPU)

So it seems like the jobs are correctly sent to the GPUs (I imagine…) but the central process isn’t… receiving the answers?

I assume there’s some additional config I need, but I can’t for the life of me find anythig in the docs that says what. I tried to run accelerate config on the command line, but no good.

Also, when I print trainer.args.parallel_mode right before training, I get ParallelMode.NOT_DISTRIBUTED.

Thanks a million for the help!!