Diffusers text-to-image finetuning example fails on multi-node

I鈥檓 trying to finetune stable diffusion 1.4/1.5 for a custom dataset base on diffusers鈥檚 example - diffusers/train_text_to_image.py at main 路 huggingface/diffusers 路 GitHub
Following the original script, I used accelerate to use multiple GPUs.

When I train it locally (2 gpus - pink curve), it converges after around 15K epoch. When I trained it on multiple nodes (4 nodes x 8 gpus - gray curve), it didn鈥檛 converge. The hyperparameters except for batch/number of steps remain the same. Do you have any idea why the fine-tuning script fails on multi-node? I suspect that EMA might need some explicit sync between nodes, but I鈥檓 not an expert on this.

Thanks in advance for any help/advice!

Hi @j-min! Would you mind opening an issue in the diffusers repo so the team can look into this?

Thanks a lot!

I鈥檝e been running on an 8x A100 Google machine and would like to bump it up to 16 using the same script (ideally). Could you point to the issue you opened or share any script changes and/or Accelerate config settings you used? Any help or guidance would be much appreciated.