I’m getting annoying crashes when I try to train a roberta model with two Titan X GPUs. I see in the documentation that the model should train on mutli gpu automatically and I see that with nvidia-smi that the gpus are in use. But I don’t see any progress and the session freezes. Any suggestions would be most helpful.
What do you mean by freeze? What do you see in the terminal?
Also show us which command you used to train the model.
I’m in jupyterlab lab just get an empty progress bar. On a single Gpu takes seconds for this test example I’m using but with the mutli gpu I wait minutes with no update. I have to restart the kernel for any reponse.
I’m using the trainer.train()
Considering you’re using a multi-GPU set up, I do not think the trainer will automatically run in distributed mode. It’ll probably run in DataParallel, but most times you want the performance gains of DistributedDataParallel. Do so, step away from the notebook and use the launch utility. So your code will look like this:
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE your-training-script.py
Tagging @sgugger who knows a lot more about the trainer than I do.
Thanks for this, I’ll give it a shoot
That is correct.
Trainer uses all available GPUs with
DataParallel if run without