Mutli GPU freezes on Roberta Pretraining

donal · October 25, 2020, 9:49pm

I’m getting annoying crashes when I try to train a roberta model with two Titan X GPUs. I see in the documentation that the model should train on mutli gpu automatically and I see that with nvidia-smi that the gpus are in use. But I don’t see any progress and the session freezes. Any suggestions would be most helpful.

BramVanroy · October 25, 2020, 9:52pm

What do you mean by freeze? What do you see in the terminal?

Also show us which command you used to train the model.

donal · October 25, 2020, 9:54pm

I’m in jupyterlab lab just get an empty progress bar. On a single Gpu takes seconds for this test example I’m using but with the mutli gpu I wait minutes with no update. I have to restart the kernel for any reponse.

I’m using the trainer.train()

BramVanroy · October 26, 2020, 8:20am

Considering you’re using a multi-GPU set up, I do not think the trainer will automatically run in distributed mode. It’ll probably run in DataParallel, but most times you want the performance gains of DistributedDataParallel. Do so, step away from the notebook and use the launch utility. So your code will look like this:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE your-training-script.py

Tagging @sgugger who knows a lot more about the trainer than I do.

donal · October 26, 2020, 8:21am

Thanks for this, I’ll give it a shoot

sgugger · October 26, 2020, 1:16pm

That is correct. Trainer uses all available GPUs with DataParallel if run without torch.distributed.launch.

InfroLab · July 11, 2022, 2:25pm

I ran into the same issue as the author of the thread. I am using 4 Teslas and trainer.train() runs infinitely without any output. I specified CUDA_VISIBLE_DEVICES=‘1,2,3,5’ but the way.

When I seet this variable at ‘1’ it all works perfectly and the progress is visible in the output.

Here you discussed DataParallel and DistributedDataParallel. I know the difference, but I don’t get from this thread whether it is possible to utlize first approach, but I assume it’s true. What can be a problem here. I am trying the script from here sentence-transformers/train_mlm.py at master · UKPLab/sentence-transformers · GitHub

Topic		Replies	Views
Trainer.train() hangs with multiple GPUs (but GPUs show activity) Beginners	4	836	October 31, 2024
How can I get advantage using multi-GPUs Beginners	5	3141	February 3, 2021
Not able to scale Trainer code to single node multi GPU 🤗Transformers	0	1115	September 14, 2023
Trainer freezes after all steps are complete (multi-gpu setting) 🤗Transformers	4	1538	February 14, 2024
Trainer.evaluate() freezing 🤗Transformers	3	498	August 23, 2024

Mutli GPU freezes on Roberta Pretraining

Related topics