I followed the diffusers to documentation to create a fine-tuned text to image LoRa model for a certain subject. I have images and captions of this subject doing various things: The dataset can be found here: fw1zr/rahul-gandhi-captions 路 Datasets at Hugging Face.
I followed the diffusers docs for training a text to image LoRa on Stable-Diffusion-v1-5 and trained on a 16GB gpu for over 7 hours but after inferencing I find that the generated outputs are very distorted and low quality.
prompt: photo of rahul gandhi, smiling, beard look, wearing glasses, speaking, with one hand up
Here is the script I used for inferencing:
import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
model_base = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16, use_safetensors=True)
pipe.unet.load_attn_procs("BootesVoid/rahul-gandhi-lora")
pipe.to("cuda")
generator = torch.Generator("cuda").manual_seed(17677)
image = pipe(
"photo of rahul gandhi, walking",
num_inference_steps=100,
guidance_scale=7.5,
generator = generator,
cross_attention_kwargs={"scale":0.7}
).images[0]
image
The model can be found here: BootesVoid/rahul-gandhi-lora 路 Hugging Face
How do I make it such that this model produces high quality photorealistic output? Do I have to switch to SDXL for fine-tuning or add some sort of upscaler to the pipeline? Or am I not inferencing correctly?