newbie to diffusers, i’m following the instruction in diffusers/examples/text_to_image at main · huggingface/diffusers · GitHub
made some small change in training script and inference
training
export MODEL_NAME='CompVis/stable-diffusion-v1-4'
export dataset_name="lambdalabs/pokemon-blip-captions"
# using CPU
accelerate launch --mixed_precision="no" train_text_to_image.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$dataset_name \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--max_train_steps=10 \
--learning_rate=1e-03 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-pokemon-model"
inference
pipe = StableDiffusionPipeline.from_pretrained(model_path, safety_checker=None, requires_safety_checker=False)
pipe = pipe.to("cpu")
# Recommended if your computer has < 64 GB of RAM
pipe.enable_attention_slicing()
however, the 10-step continuously training made the model totally messy, the inference just returned something pure noise
question: with pokemon data, even only 10 steps training, the result became totally messy, anything incompatible for fine tune?
btw, changed to 0 steps, everything works fine, the inference returns the good result as pretrained one, so the model loading and saving parts are good.
thanks!