Tiny fine tune messed up the pretrained sd v1-4 model

newbie to diffusers, i鈥檓 following the instruction in diffusers/examples/text_to_image at main 路 huggingface/diffusers 路 GitHub

made some small change in training script and inference


export MODEL_NAME='CompVis/stable-diffusion-v1-4'

export dataset_name="lambdalabs/pokemon-blip-captions"
# using CPU
accelerate launch --mixed_precision="no"  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=10 \
  --learning_rate=1e-03 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \


pipe = StableDiffusionPipeline.from_pretrained(model_path, safety_checker=None, requires_safety_checker=False)
pipe = pipe.to("cpu")
# Recommended if your computer has < 64 GB of RAM

however, the 10-step continuously training made the model totally messy, the inference just returned something pure noise

question: with pokemon data, even only 10 steps training, the result became totally messy, anything incompatible for fine tune?

btw, changed to 0 steps, everything works fine, the inference returns the good result as pretrained one, so the model loading and saving parts are good.


Could you provide the prompts you used for comparison and the images you got out from the prompts?

0 training step means that there was no fine-tuning. It might so have happened that the pre-trained model generated a plausible image w.r.t the input prompt you had given.

i鈥檓 using random prompts, e.g. 鈥測oda鈥, 鈥渁 warrior on horse鈥, etc.
the issue here is that, after loading the pretrained model, just a few steps (even with small learning rate in a small dataset) would mess up the entire model, that looks unexpected. i doubt any config or format issue.