newbie to diffusers, i’m following the instruction in diffusers/examples/text_to_image at main · huggingface/diffusers · GitHub
made some small change in training script and inference
export MODEL_NAME='CompVis/stable-diffusion-v1-4' export dataset_name="lambdalabs/pokemon-blip-captions" # using CPU accelerate launch --mixed_precision="no" train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --use_ema \ --resolution=512 --center_crop --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=10 \ --learning_rate=1e-03 \ --max_grad_norm=1 \ --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, safety_checker=None, requires_safety_checker=False) pipe = pipe.to("cpu") # Recommended if your computer has < 64 GB of RAM pipe.enable_attention_slicing()
however, the 10-step continuously training made the model totally messy, the inference just returned something pure noise
question: with pokemon data, even only 10 steps training, the result became totally messy, anything incompatible for fine tune?
btw, changed to 0 steps, everything works fine, the inference returns the good result as pretrained one, so the model loading and saving parts are good.