Diffusers text-to-image finetuning example fails on multi-node

j-min · January 2, 2023, 8:59pm

I’m trying to finetune stable diffusion 1.4/1.5 for a custom dataset base on diffusers’s example - diffusers/train_text_to_image.py at main · huggingface/diffusers · GitHub
Following the original script, I used accelerate to use multiple GPUs.

When I train it locally (2 gpus - pink curve), it converges after around 15K epoch. When I trained it on multiple nodes (4 nodes x 8 gpus - gray curve), it didn’t converge. The hyperparameters except for batch/number of steps remain the same. Do you have any idea why the fine-tuning script fails on multi-node? I suspect that EMA might need some explicit sync between nodes, but I’m not an expert on this.

Thanks in advance for any help/advice!

pcuenq · January 3, 2023, 10:16am

Hi @j-min! Would you mind opening an issue in the diffusers repo so the team can look into this?

Thanks a lot!

jbmaxwell · March 30, 2023, 4:36pm

I’ve been running on an 8x A100 Google machine and would like to bump it up to 16 using the same script (ideally). Could you point to the issue you opened or share any script changes and/or Accelerate config settings you used? Any help or guidance would be much appreciated.

Topic		Replies	Views
Stable diffusion `train_text_to_image.py` only on one gpu 🧨 Diffusers	5	1199	May 2, 2023
Discrepancies between CompVis and Diffuser fine-tuning? 🧨 Diffusers	5	1218	November 7, 2022
Single Node Multi GPU FlanT5 fine-tuning using HF Dataset and HF Trainer 🤗Transformers	4	2081	July 5, 2023
\multi-node finetuning with Trainer 🤗Transformers	0	485	July 27, 2022
Stable Diffusion FP16 on multi-GPU setups? 🧨 Diffusers	0	3892	August 24, 2022

Diffusers text-to-image finetuning example fails on multi-node

Related topics