Notebook_launcher in diffusers_training_example.ipynb fails with num_processes>=2

hugbump · February 19, 2023, 6:24pm

Dear HF community:

I try to run diffusers_training_example.ipynb on a subset of the CELEBA-HQ dataset. Specifically,

config.dataset_name = "huggan/CelebA-faces"
dataset = load_dataset(config.dataset_name, split="train")

dataset.set_transform(transform)

from torch.utils.data import Subset
dataset = Subset(dataset, range(5000))

...

## Then only changing num_processes= 1->4
notebook_launcher(train_loop, args, num_processes=4)

And I got an error saying

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

However with num_processes=1 it is good. I used Lambda Cloud GPU and tried on both 8xA100 and 8xTeslaV100 instances but got the same error. My pytorch version is 1.12. I searched a bit and tried a few methods but they didn’t work. `notebook_launcher` fails with `num_processes>=2` · Issue #182 · huggingface/accelerate · GitHub seems similar. Do you know what is going on and can you give me some pointers? Thank you very much!

hugbump · February 19, 2023, 7:09pm

By the way, regarding training on multiple GPUs, I searched a bit and it seems lambda cloud gpu can do the job, contingency on this issue being solved.

My current knowledge is that Colab Pro and Pro+ provide only 1 GPU although a good GPU. Amazon SageMaker seems pretty complicated, and documentation doesn’t seem sustaining, and the price isn’t competitive than lambda cloud. If you have some tips on where to access multiple GPUs please let me know. Initially I just wanna test water on cloud. Thank you very much.

Topic		Replies	Views
Notebook_launcher set num_processes=2 but it say Launching training on one GPU. in Kaggle 🤗Accelerate	6	1928	December 10, 2022
Cannot run on more than one GPU Models	1	549	September 27, 2023
Missing positional arguments when try to use multiple GPUs with accelerator 🤗Accelerate	4	2071	May 11, 2021
HF Accelerate uses multiple GPUs even when setting `num_processes` to 1 🤗Accelerate	0	90	August 2, 2024
Stable diffusion `train_text_to_image.py` only on one gpu 🧨 Diffusers	5	1191	May 2, 2023

Notebook_launcher in diffusers_training_example.ipynb fails with num_processes>=2

Related topics