How to increase quality of fine-tuned text-to-image LoRa?

I followed the diffusers to documentation to create a fine-tuned text to image LoRa model for a certain subject. I have images and captions of this subject doing various things: The dataset can be found here: fw1zr/rahul-gandhi-captions 路 Datasets at Hugging Face.

I followed the diffusers docs for training a text to image LoRa on Stable-Diffusion-v1-5 and trained on a 16GB gpu for over 7 hours but after inferencing I find that the generated outputs are very distorted and low quality.

prompt: photo of rahul gandhi, smiling, beard look, wearing glasses, speaking, with one hand up

Here is the script I used for inferencing:

import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler


model_base = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16, use_safetensors=True)

pipe.unet.load_attn_procs("BootesVoid/rahul-gandhi-lora")
pipe.to("cuda")


generator = torch.Generator("cuda").manual_seed(17677)
image = pipe(
    "photo of rahul gandhi, walking", 
    num_inference_steps=100, 
    guidance_scale=7.5, 
    generator = generator,
    cross_attention_kwargs={"scale":0.7}
).images[0]
image

The model can be found here: BootesVoid/rahul-gandhi-lora 路 Hugging Face

How do I make it such that this model produces high quality photorealistic output? Do I have to switch to SDXL for fine-tuning or add some sort of upscaler to the pipeline? Or am I not inferencing correctly?