Text-to-image training loss becomes nan all of a sudden

Hisan · March 31, 2023, 9:48pm

Description

Hello,

I am trying to finetune the stable diffusion 2.1 model on a custom dataset. I am using the example script provided here. The training starts alright, the loss is decreasing, but then randomly, the loss becomes nan, and the model starts to output black images. It is not an issue with the dataset as this once happened after an entire epoch (so it had iterated over all samples). This is completely random, I once got it to train for 1800 steps before running into the problem, while on average it takes around 200 steps.

I have tried 2 different datasets. One was a custom dataset, of 768x786 images, 2000 data samples. The other one was this fashion dataset from kaggle. I used a subset of around 1800 images from this.

I have tried batch sizes upto 4. I have also tried training with and without the xformers flag.

Any ideas what the problem could be?
Thanks.

The command I am using. (The training script is unchanged)

accelerate launch train_text_to_image.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" --train_data_dir="dataset\dataset_fashion" --resolution=512 --center_crop --train_batch_size=1 --gradient_accumulation_steps=1 --gradient_checkpointing --max_train_steps=10000 --learning_rate=1e-6 --max_grad_norm=0.5 --lr_scheduler="constant" --output_dir="output" --checkpointing_steps=200 --enable_xformers_memory_efficient_attention

System Info

GPU: 24GB Titan RTX
diffusers version: 0.15.0.dev0
Platform: Windows-10-10.0.19044-SP0
Python version: 3.9.16
PyTorch version (GPU?): 1.13.1+cu116 (True)
Huggingface_hub version: 0.13.3
Transformers version: 4.27.3
Accelerate version: 0.18.0
xFormers version: 0.0.16
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

jbmaxwell · April 1, 2023, 12:19am

I’ve seen this a couple times. In my case I was able to get around it by lowering the learning rate, but yours is already quite low, so not too sure, unfortunately.

sayakpaul · April 6, 2023, 3:26am

What happens if you up the grad norm to 1.0?

Hisan · April 6, 2023, 7:58am

Same results. I was initially trying it at 1.0. I later changed it to 0.5.

AlexMir · July 24, 2023, 4:58pm

did anyone find out what cause the problem?

alessoh · July 25, 2023, 9:49pm

Thanks for sharing.
Peter

wren93 · November 7, 2023, 7:28pm

Hi, did you figure out what was the problem later? Is it related to FP16?

luoyeguigenno1 · September 19, 2024, 8:46am

I used the same GPU and encountered the same issue. Disable xformers_memory_efficient_attention and change the precision to fp32 solved the problem in my case.

Topic		Replies	Views
Diffusers text-to-image finetuning example fails on multi-node 🧨 Diffusers	2	703	March 30, 2023
Loss drops normally but stops improving quickly 🧨 Diffusers	3	5520	March 9, 2023
Tiny fine tune messed up the pretrained sd v1-4 model 🧨 Diffusers	2	504	April 8, 2023
How to use fine tuned a pre-trained text to image model? 🧨 Diffusers	0	45	August 22, 2024
Stable diffusion text_to_image.py discussion 🧨 Diffusers	1	361	May 22, 2023

Text-to-image training loss becomes nan all of a sudden

Description

System Info

Related topics