Tiny fine tune messed up the pretrained sd v1-4 model

newbie to diffusers, i’m following the instruction in diffusers/examples/text_to_image at main · huggingface/diffusers · GitHub

made some small change in training script and inference

training

export MODEL_NAME='CompVis/stable-diffusion-v1-4'

export dataset_name="lambdalabs/pokemon-blip-captions"
# using CPU
accelerate launch --mixed_precision="no"  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=10 \
  --learning_rate=1e-03 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model" 

inference

pipe = StableDiffusionPipeline.from_pretrained(model_path, safety_checker=None, requires_safety_checker=False)
pipe = pipe.to("cpu")
# Recommended if your computer has < 64 GB of RAM
pipe.enable_attention_slicing()

however, the 10-step continuously training made the model totally messy, the inference just returned something pure noise

question: with pokemon data, even only 10 steps training, the result became totally messy, anything incompatible for fine tune?

btw, changed to 0 steps, everything works fine, the inference returns the good result as pretrained one, so the model loading and saving parts are good.

thanks!

Could you provide the prompts you used for comparison and the images you got out from the prompts?

0 training step means that there was no fine-tuning. It might so have happened that the pre-trained model generated a plausible image w.r.t the input prompt you had given.

i’m using random prompts, e.g. “yoda”, “a warrior on horse”, etc.
the issue here is that, after loading the pretrained model, just a few steps (even with small learning rate in a small dataset) would mess up the entire model, that looks unexpected. i doubt any config or format issue.