@sgugger I am using Trainer classes but not seeing any major speedup in training if I use a multi-GPU setup. In nvidia-smi and the W&B dashboard, I can see that both GPUs are being used. I then launched the training script on a single-GPU for comparison. The training commands are exactly the same on both machines.
I do not see any significant speedup in training. The training lasts for hours, I didn’t wait till the end, but tqdm estimates are pretty much the same on both machines. The progress should be reflected properly in tqdm, right? Any suggestions for further diagnosis?