Batch sizes / 2 GPUs + Windows 10 = 1 GPU?

Hope you can help. Basically I just need some guidance/reassurance around how batch sizes are calculated when 2 GPUs are installed but I think (!) on Windows only 1 GPU can be/is being used.


  • Remoting into PC with 2 NVIDIA GPUs running Windows 10

  • (someone else’s machine so no option of installing Linux)

  • I have the GPU enabled version of PyTorch installed

  • Running from the transformers repo

  • I set the paramete per_device_train_batch_size = 4

  • I do a test run with 10,000 training samples


  • “_n_gpu=2” printed at start of run, so script has detected 2 GPUs

  • (I have confirmed this directly in another script using torch.cuda.device_count() )

  • I get the warning: “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\cuda\ UserWarning: PyTorch is not compiled with NCCL support”

  • (I believe the lack of NCCL support on Windows is the reason why multiple GPU training on Windows is not possible?)

  • I get 1,250 steps per epoch


  • I assuming that PyTorch defaults to using just 1 GPU instead of the 2 available, hence the warning? (it certainly runs a lot, lot quicker than just on CPU)

  • Given 2 GPUs installed, batch per device 4 and 1,250 seems to suggest an effective batch size of 8. So is it being automatically adjusted to 2 x 4 = 8 given only 1 GPU being used but 2 GPUs present (just checking that a batch of 4 is not being skipped for the other GPU detected but not being used?)

Many thanks!

I saw this post involving @BramVanroy about setting CUDA_VISIBLE_DEVICES=0 to use just one of the 2 GPUs installed (I assume named 0 and 1). But is there any way to verify that only 1 GPU is being used when running the script? And I suppose even if so, doesn’t necessarily clarify how per_device_train_batch_size = 4 is being used when 2 GPUs are present, but I think (!) only one GPU being used.

The second GPU has been physically removed and so running with 1 GPU on Windows 10 so that there is no amiguity about what is going on.

If need be, you can verify how many GPUs torch sees - but it is ensured with CUDA_VISIBLE_DEVICES which/how many are usable. With this variable, the scripts that follow simply only have access to the specified devices. They have no knowledge about any other devices. So on my machine with 4 GPUs I can do the following and torch will only ever have access to GPU #0.

CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# True 1

but without the variable, torch can see and use all GPUs.

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# True 4

The NCCL backend thing is important indeed. I am not sure how far they are currently with the Windows implementation. If not available, that means you cannot use DistributedDataParallel (recommended) but AFAIK you can still use DataParallel.

@BramVanroy thank you very much for the reply. I took the safe (but not too sophisticated) option of simply physically removing the second GPU so that I know I have one GPU on Windows that is a setup that should work.

2 GPUs was confusing - there was the NCCL warning (I assume indirectly from PyTorch) but then both GPU had similar memory allocations and it was confusing what HF was doing in relation to samples per batch per device, where you (I !!!) don’t know if both GPUs are working or if they are simply duplicating the work.

Anyway, went for the safe option but thanks again!

I understand that you want to be sure, but no need to worry. The trainer will probably simply trigger DataPrallel (non-distributed).

This uses two GPUs, but non-distributed. This is less performant than using DistributedDataParallel but is still significantly faster than a single GPU. What it does is dividing the work over two (or more) GPUs by dividing the given batch size over the different devices. In the back pass the gradients are then summed.

1 Like

Thank you, great to have an explanation of what might have been happening - it seemed to be working fine with 2 but as discussed certainty is probably better than speed but with risks (for some anyway!)