Mutli GPU freezes on Roberta Pretraining

I’m getting annoying crashes when I try to train a roberta model with two Titan X GPUs. I see in the documentation that the model should train on mutli gpu automatically and I see that with nvidia-smi that the gpus are in use. But I don’t see any progress and the session freezes. Any suggestions would be most helpful.

What do you mean by freeze? What do you see in the terminal?

Also show us which command you used to train the model.

I’m in jupyterlab lab just get an empty progress bar. On a single Gpu takes seconds for this test example I’m using but with the mutli gpu I wait minutes with no update. I have to restart the kernel for any reponse.

I’m using the trainer.train()

Considering you’re using a multi-GPU set up, I do not think the trainer will automatically run in distributed mode. It’ll probably run in DataParallel, but most times you want the performance gains of DistributedDataParallel. Do so, step away from the notebook and use the launch utility. So your code will look like this:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE

Tagging @sgugger who knows a lot more about the trainer than I do.

Thanks for this, I’ll give it a shoot

That is correct. Trainer uses all available GPUs with DataParallel if run without torch.distributed.launch.