Stable diffusion `train_text_to_image.py` only on one gpu

jbmaxwell · February 10, 2023, 6:38pm

I’m training a stable diffusion model using a modified version of the train_text_to_image.py script. I’m noticing that it’s only running on one (of two) gpus. Is there a simple fix for this? Is it possible to run this script on multiple gpus? (It seems like it should be, since there are references to both “parallel” and “distributed” in the code…)

How do I enable multi-gpu? Or, how might I approach debugging why it’s not working?

EDIT: I see that Accelerate is supposed to handle this dynamically. So I guess I’m curious where to start looking for the problem.

jbmaxwell · February 11, 2023, 12:24am

I’m seeing references to accelerator.num_processes, but I can’t set it… ?? Can I only set it from a config file?

More info…

config:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false

At launch I’m seeing this:

02/10/2023 18:46:44 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

Is there something I have to set to use the default_config.yaml file?

accelerate test seems okay:

stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: Mixed precision type: fp16
stdout: 
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout: 
stdout: **DataLoader integration test**
stdout: 0 1 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>
stdout: tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:1') <class 'accelerate.data_loader.DataLoaderShard'>
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
stdout: 
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: FP16 training check.
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
Test is a success! You are ready for your distributed training!

jbmaxwell · February 11, 2023, 1:10am

oh… my… goodness…

not
python /my/bloody/script.py
but
accelerate launch /my/bloody/script.py… yikes…

alexchenyu · April 28, 2023, 5:31pm

Hi did you figure it out? Can you run the script on multiple gpus now?

jbmaxwell · April 28, 2023, 8:18pm

Yeah, you have run the accelerate config script, but once you’ve done that, just call it with accelerate launch rather than python. Then accelerate does all the heavy lifting for you!

williamberman · May 2, 2023, 8:50pm

Hey @jbmaxwell yes that is correct, sorry for the confusion. For context, any python script in multi gpu is usually launched differently than just running a standard script as multiple processes have to be spawned and orchestrated.

Here’s some more related docs if they’re helpful

https://pytorch.org/docs/stable/elastic/run.html

Topic		Replies	Views
Diffusers text-to-image finetuning example fails on multi-node 🧨 Diffusers	2	698	March 30, 2023
Accelerate doesn't seem to use my GPU? 🤗Accelerate	7	5702	September 18, 2024
Detecting single gpu within each node 🤗Accelerate	2	757	January 17, 2023
Accelerate Multi-GPU on several Nodes How to 🤗Accelerate	3	6277	October 13, 2021
Multi-GPU Training sometimes working with 2GPU, but never more than 2 🤗Accelerate	5	2995	August 8, 2024

Stable diffusion `train_text_to_image.py` only on one gpu

Related topics