Notebook_launcher set num_processes=2 but it say Launching training on one GPU. in Kaggle

Indramal · December 8, 2022, 3:00am

I am trying to test this article code with A100 x 2 GPUs. Link - Launching Multi-Node Training from a Jupyter Environment

But it always gets only one GPU in Kaggle Notebook. How to solve this issue?

Print - Launching training on one GPU. but it has 2 GPU.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   43C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I think this point of code run - accelerate/launchers.py at v0.15.0 · huggingface/accelerate · GitHub

muellerzr · December 8, 2022, 1:07pm

What’s your version of Accelerate? Only the latest version (0.15.0) will launch in Kaggle successfully. What you pointed out there was the Google Colab check statement

Indramal · December 8, 2022, 3:29pm

thank you very much. Kaggle has already installed that library and my installation does not work. Now I update it to 0.15.0 and it is working now.

Indramal · December 8, 2022, 3:50pm

Results:

Colab

epoch 0: 87.67
epoch 1: 89.31
epoch 2: 93.93
epoch 3: 96.97
epoch 4: 97.55
Total execution time = 354.609 sec

Kaggle

epoch 0: 91.68
epoch 1: 89.31
epoch 2: 94.97
epoch 3: 97.46
epoch 4: 97.51
Total execution time = 427.316 sec

1 GPU Colab is faster than 2 GPU.

muellerzr · December 8, 2022, 4:18pm

You should make sure you’re actually setting up your benchmarks right by reading our docs on it, as it’s very easy to just think that running the same script does the same thing (spoiler, it does not! ) Comparing performance between different device setups

Indramal · December 9, 2022, 3:02am

I am sorry to tell you that, it is not getting fast execution on 2 GPUs. I don’t know why.

This is what I use as a sample code - Launching Multi-GPU Training from a Jupyter Environment

These are what I changed according to Comparing performance between different device setups

Setting Seed

set_seed(42) - Both are the same value in 1 GPU and 2 GPU.

Batch Sizes

In Colab - 128

def get_dataloaders(batch_size: int = 128):
-----

def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 128):
-----

args = ("fp16", 42, 128)
notebook_launcher(training_loop, args, num_processes=1)

In Kaggle - 64

def get_dataloaders(batch_size: int = 64):
--

def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64):
---

args = ("fp16", 42, 64)
notebook_launcher(training_loop, args, num_processes=2)

Learning Rates

Both use same code

# Intantiate the optimizer
learning_rate = 3e-2 / 25
learning_rate *= accelerator.num_processes
optimizer = torch.optim.Adam(params=model.parameters(), lr=learning_rate)

Results:

In Kaggle (2 GPU):

Launching training on 2 GPUs.
epoch 0: 89.86
epoch 1: 87.50
epoch 2: 94.30
epoch 3: 96.78
epoch 4: 97.61
Total execution time = 460.321 sec

In Colab (1 GPU):

Launching training on one GPU.
epoch 0: 87.96
epoch 1: 87.36
epoch 2: 94.15
epoch 3: 97.16
epoch 4: 97.55
Total execution time = 341.572 sec

This is the code I use to calculate the time:

start_time = time.time()
args = ("fp16", 42, xx)
notebook_launcher(training_loop, args, num_processes=xx)
end_time = time.time()
print("Total execution time = {:.3f} sec".format(end_time - start_time))

xx - it changes depending on the system.

Indramal · December 10, 2022, 3:02am

@sgugger @patrickvonplaten any suggestions?

Topic		Replies	Views
Why is Trainer only using 1 (not 4) GPUs? Beginners	1	1589	June 2, 2022
Cannot run on more than one GPU Models	1	549	September 27, 2023
Notebook_launcher in diffusers_training_example.ipynb fails with num_processes>=2 🧨 Diffusers	1	659	February 19, 2023
Missing positional arguments when try to use multiple GPUs with accelerator 🤗Accelerate	4	2069	May 11, 2021
`num_processes == 1` even when I set it to `--num_processes 2` 🤗Accelerate	5	3283	May 18, 2023

Notebook_launcher set num_processes=2 but it say Launching training on one GPU. in Kaggle

Setting Seed

Batch Sizes

Learning Rates

Results:

Related topics