Diffusers text-to-image finetuning example fails on multi-node

I鈥檓 trying to finetune stable diffusion 1.4/1.5 for a custom dataset base on diffusers鈥檚 example - diffusers/train_text_to_image.py at main 路 huggingface/diffusers 路 GitHub
Following the original script, I used accelerate to use multiple GPUs.

When I train it locally (2 gpus - pink curve), it converges after around 15K epoch. When I trained it on multiple nodes (4 nodes x 8 gpus - gray curve), it didn鈥檛 converge. The hyperparameters except for batch/number of steps remain the same. Do you have any idea why the fine-tuning script fails on multi-node? I suspect that EMA might need some explicit sync between nodes, but I鈥檓 not an expert on this.

Thanks in advance for any help/advice!

Hi @j-min! Would you mind opening an issue in the diffusers repo so the team can look into this?

Thanks a lot!