I’m trying to finetune stable diffusion 1.4/1.5 for a custom dataset base on diffusers’s example - diffusers/train_text_to_image.py at main · huggingface/diffusers · GitHub
Following the original script, I used accelerate to use multiple GPUs.
When I train it locally (2 gpus - pink curve), it converges after around 15K epoch. When I trained it on multiple nodes (4 nodes x 8 gpus - gray curve), it didn’t converge. The hyperparameters except for batch/number of steps remain the same. Do you have any idea why the fine-tuning script fails on multi-node? I suspect that EMA might need some explicit sync between nodes, but I’m not an expert on this.
Thanks in advance for any help/advice!