Batch sizes / 2 GPUs + Windows 10 = 1 GPU?

TheLongSentance · August 19, 2021, 11:45am

Hope you can help. Basically I just need some guidance/reassurance around how batch sizes are calculated when 2 GPUs are installed but I think (!) on Windows only 1 GPU can be/is being used.

Scenario:

Remoting into PC with 2 NVIDIA GPUs running Windows 10
(someone else’s machine so no option of installing Linux)
I have the GPU enabled version of PyTorch installed
Running run_summarization.py from the transformers repo
I set the paramete per_device_train_batch_size = 4
I do a test run with 10,000 training samples

Result:

“_n_gpu=2” printed at start of run, so script has detected 2 GPUs
(I have confirmed this directly in another script using torch.cuda.device_count() )
I get the warning: “C:\Users\BrianS.virtualenvs\summarization-DUOCBs9B\lib\site-packages\torch\cuda\nccl.py:15: UserWarning: PyTorch is not compiled with NCCL support”
(I believe the lack of NCCL support on Windows is the reason why multiple GPU training on Windows is not possible?)
I get 1,250 steps per epoch

Questions:

I assuming that PyTorch defaults to using just 1 GPU instead of the 2 available, hence the warning? (it certainly runs a lot, lot quicker than just on CPU)
Given 2 GPUs installed, batch per device 4 and 1,250 seems to suggest an effective batch size of 8. So is it being automatically adjusted to 2 x 4 = 8 given only 1 GPU being used but 2 GPUs present (just checking that a batch of 4 is not being skipped for the other GPU detected but not being used?)

Many thanks!

TheLongSentance · August 19, 2021, 11:59am

I saw this post involving @BramVanroy about setting CUDA_VISIBLE_DEVICES=0 to use just one of the 2 GPUs installed (I assume named 0 and 1). But is there any way to verify that only 1 GPU is being used when running the script? And I suppose even if so, doesn’t necessarily clarify how per_device_train_batch_size = 4 is being used when 2 GPUs are present, but I think (!) only one GPU being used.

TheLongSentance · August 20, 2021, 1:04pm

The second GPU has been physically removed and so running with 1 GPU on Windows 10 so that there is no amiguity about what is going on.

BramVanroy · August 21, 2021, 2:20pm

If need be, you can verify how many GPUs torch sees - but it is ensured with CUDA_VISIBLE_DEVICES which/how many are usable. With this variable, the scripts that follow simply only have access to the specified devices. They have no knowledge about any other devices. So on my machine with 4 GPUs I can do the following and torch will only ever have access to GPU #0.

CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# True 1

but without the variable, torch can see and use all GPUs.

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
# True 4

The NCCL backend thing is important indeed. I am not sure how far they are currently with the Windows implementation. If not available, that means you cannot use DistributedDataParallel (recommended) but AFAIK you can still use DataParallel.

TheLongSentance · August 21, 2021, 5:00pm

@BramVanroy thank you very much for the reply. I took the safe (but not too sophisticated) option of simply physically removing the second GPU so that I know I have one GPU on Windows that is a setup that should work.

2 GPUs was confusing - there was the NCCL warning (I assume indirectly from PyTorch) but then both GPU had similar memory allocations and it was confusing what HF was doing in relation to samples per batch per device, where you (I !!!) don’t know if both GPUs are working or if they are simply duplicating the work.

Anyway, went for the safe option but thanks again!

BramVanroy · August 22, 2021, 10:49am

I understand that you want to be sure, but no need to worry. The trainer will probably simply trigger DataPrallel (non-distributed).

github.com

huggingface/transformers/blob/143738214cb83e471f3a43652617c8881370342c/src/transformers/trainer.py#L940-L942

    
      
          # Multi-gpu training (should be after apex fp16 initialization)
          if self.args.n_gpu > 1:
              model = nn.DataParallel(model)

This uses two GPUs, but non-distributed. This is less performant than using DistributedDataParallel but is still significantly faster than a single GPU. What it does is dividing the work over two (or more) GPUs by dividing the given batch size over the different devices. In the back pass the gradients are then summed.

TheLongSentance · August 22, 2021, 3:53pm

Thank you, great to have an explanation of what might have been happening - it seemed to be working fine with 2 but as discussed certainty is probably better than speed but with risks (for some anyway!)

Topic		Replies	Views
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1596	June 2, 2022
How to restrict training to one GPU if multiple are available, co 🤗Transformers	4	14365	November 1, 2023
Clarifying multi-GPU memory usage Beginners	1	1406	November 5, 2020
How to specify different batch sizes for different GPUs when training with rum_mlm.py? Beginners	1	1104	July 26, 2021
What is my batch size..? 🤗Accelerate	2	2350	April 29, 2024

Batch sizes / 2 GPUs + Windows 10 = 1 GPU?

Related topics