How to increase quality of fine-tuned text-to-image LoRa?

I followed the diffusers to documentation to create a fine-tuned text to image LoRa model for a certain subject. I have images and captions of this subject doing various things: The dataset can be found here: fw1zr/rahul-gandhi-captions 路 Datasets at Hugging Face.

I followed the diffusers docs for training a text to image LoRa on Stable-Diffusion-v1-5 and trained on a 16GB gpu for over 7 hours but after inferencing I find that the generated outputs are very distorted and low quality.

prompt: photo of rahul gandhi, smiling, beard look, wearing glasses, speaking, with one hand up

Here is the script I used for inferencing:

import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler

model_base = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16, use_safetensors=True)


generator = torch.Generator("cuda").manual_seed(17677)
image = pipe(
    "photo of rahul gandhi, walking", 
    generator = generator,

The model can be found here: BootesVoid/rahul-gandhi-lora 路 Hugging Face

How do I make it such that this model produces high quality photorealistic output? Do I have to switch to SDXL for fine-tuning or add some sort of upscaler to the pipeline? Or am I not inferencing correctly?