I’m getting annoying crashes when I try to train a roberta model with two Titan X GPUs. I see in the documentation that the model should train on mutli gpu automatically and I see that with nvidia-smi that the gpus are in use. But I don’t see any progress and the session freezes. Any suggestions would be most helpful.
What do you mean by freeze? What do you see in the terminal?
Also show us which command you used to train the model.
I’m in jupyterlab lab just get an empty progress bar. On a single Gpu takes seconds for this test example I’m using but with the mutli gpu I wait minutes with no update. I have to restart the kernel for any reponse.
I’m using the trainer.train()
Considering you’re using a multi-GPU set up, I do not think the trainer will automatically run in distributed mode. It’ll probably run in DataParallel, but most times you want the performance gains of DistributedDataParallel. Do so, step away from the notebook and use the launch utility. So your code will look like this:
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE your-training-script.py
Tagging @sgugger who knows a lot more about the trainer than I do.
Thanks for this, I’ll give it a shoot
That is correct.
Trainer uses all available GPUs with
DataParallel if run without
I ran into the same issue as the author of the thread. I am using 4 Teslas and trainer.train() runs infinitely without any output. I specified CUDA_VISIBLE_DEVICES=‘1,2,3,5’ but the way.
When I seet this variable at ‘1’ it all works perfectly and the progress is visible in the output.
Here you discussed DataParallel and DistributedDataParallel. I know the difference, but I don’t get from this thread whether it is possible to utlize first approach, but I assume it’s true. What can be a problem here. I am trying the script from here sentence-transformers/train_mlm.py at master · UKPLab/sentence-transformers · GitHub