Trainer.train() hangs with multiple GPUs (but GPUs show activity)

grillbikecheerfully · April 15, 2024, 3:32am

I’m going through the huggingface tutorials and going through the “Training a causal language model from scratch” sections.

That page says “If you have access to a machine with multiple GPUs, try to run the code there. The Trainer automatically manages multiple machines, and this can speed up training tremendously.”

I’m working on a machine with 8 GPUs, and unfortunately, it doesn’t seem to work for me. When I set the CUDA_VISIBLE_DEVICES environment variable to “0”, it works perfectly fine on one GPU. However, if I don’t set that environment variable, or set it to more than 1 GPU, the following happens

All the relevant GPUs specified in CUDA_VISIBLE_DEVICES seem to fire at 100% utitlization
The tqdm progress bar in my script doesn’t move, even after hours (it moves quite quickly with one GPU)

So it seems like the jobs are correctly sent to the GPUs (I imagine…) but the central process isn’t… receiving the answers?

I assume there’s some additional config I need, but I can’t for the life of me find anythig in the docs that says what. I tried to run accelerate config on the command line, but no good.

Also, when I print trainer.args.parallel_mode right before training, I get ParallelMode.NOT_DISTRIBUTED.

Thanks a million for the help!!

LetsHugItOut · October 30, 2024, 5:18pm

Did you find an answer?

John6666 · October 30, 2024, 11:45pm

There seem to be many problems that occur with trainers when using multiple GPUs.
https://discuss.huggingface.co/search?q=multi%20gpu%20trainer%20order%3Alatest

grillbikecheerfully · October 31, 2024, 1:13am

Unfortunately I never did - apologies I ended up moving to another server with 8 GPUs, and it magically work, so I can only assume it was something to do with the way the 8 GPUs were set up on the first machine.

John6666 · October 31, 2024, 1:38am

Is this the kind of hardware problem? Anyway, I guess there was a workaround.

Topic		Replies	Views
Mutli GPU freezes on Roberta Pretraining Beginners	6	2063	July 11, 2022
Multiple gpu training 🤗Transformers	1	2322	August 10, 2024
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1589	June 2, 2022
Error occurs when loading additional parameters in multi-gpu training Beginners	1	349	December 14, 2021
Setting specific device for Trainer Beginners	25	41737	July 21, 2024

Trainer.train() hangs with multiple GPUs (but GPUs show activity)

Related topics