People have been using Dreamboth to teach foreground subjects or faces unknown to Stable Diffusion (DreamBooth fine-tuning example). But is it possible to train a Dreambooth model for a specific background scene, like a studio?
I ran the diffusers example script and it worked well for foreground subjects; I also see some examples for specific styles, but I can’t get any results for background scenes.
Following the tutorial I used:
- SD-2.1 to create ~100 class images with prompts like first person view of a studio.
- I got 3-5 pictures of the specific background for the instances, with prompts like first person view of a [sks] studio. I also tried instances images where there is a random foreground subject.
- I used learning rates as high/low as 5e-5/5e-7
The loss oscillates between 0.4-0.8 and never goes down. At inference, generated images don’t look anything like the provided instance pictures.
Is there a way to impart specific backgrounds into SD? Objective is to keep background coherence between multiple generated images, but also benefit from details like projected shadows and reflections which would be lost by just compositing a foreground mask with a fix background.
@Rapot If you specifically editing the background, I would use CLIPSeg to mask the background then pass it to StableDiffusion model to generate studio backgrounds.
So the logic would look like this for generating masks:
from diffusers import StableDiffusionInpaintPipeline
from transformers import AutoProcessor, CLIPSegForImageSegmentation
from PIL import Image
from IPython.display import display
from torchvision import transforms
import matplotlib.pyplot as plt
assert torch.cuda.is_available(), "No GPU found!"
processor = AutoProcessor.from_pretrained("CIDAS/clipseg-rd64-refined")
model = CLIPSegForImageSegmentation.from_pretrained("CIDAS/clipseg-rd64-refined").to('cuda') # Move the model to GPU
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = ["a cat", "a remote", "a blanket"]
inputs = processor(text=texts, images=[image] * len(texts), padding=True, return_tensors="pt").to('cuda')
outputs = model(**inputs)
logits = outputs.logits
_, ax = plt.subplots(1, len(texts) + 1, figsize=(3*(len(texts) + 1), 4))
[a.axis('off') for a in ax.flatten()]
[ax[i+1].imshow(torch.sigmoid(outputs.logits[i].detach().cpu()), cmap='gray') for i in range(len(texts))] # Move the data to CPU for visualization
[ax[i+1].text(0, -15, texts[i]) for i in range(len(texts))]
Then you can select the generated masks:
# Selecting a mask
mask_number = 2
selected_mask = torch.sigmoid(outputs.logits[mask_number]).unsqueeze(0).cpu() # Move the data to CPU for visualization and further processing
stable_diffusion_mask = transforms.ToPILImage()(selected_mask)
# Detach the tensor from the computation graph and convert to NumPy for visualization
selected_mask_np = selected_mask.detach().cpu().numpy()
plt.imshow(selected_mask_np.squeeze(), cmap='gray') # Squeeze to remove the channel dimension
plt.axis('off') # Hide the axis
and then you can use stable diffusion inpainting to fill in the mask for you. You can refer to the following documentation :
Thank you for your reply.
Wouldn’t this generate a slightly different background every time, even with the same prompt, when using different foreground subjects?
I’m hoping to teach SD a specific background to generate everytime, like one can do with faces with Dreambooth or Textual inversion