Guidance on Training Stable Diffusion Models for Image Generation with Multiple Object Categories

Hii Community,

I am currently working on training a Stable Diffusion model for image generation, and I have encountered a few challenges that I believe your expertise can help me overcome.

I am in the process of training a Stable Diffusion model using a unique dataset. This dataset consists of images grouped into multiple categories, with each category containing only a single image. Importantly, these images do not have any added noise, and I have also provided captions for each image. The training process is being conducted on Dreambooth.

One of the initial obstacles I encountered was GPU memory exhaustion due to the substantial dataset size. As a workaround, I decided to train the model on a smaller subset of the dataset

Challenges and Issues:

  • Generating images from the model works relatively well for a single category of objects, problems arise when attempting to combine objects from two different categories. The combined results are not satisfactory, and the generated images often exhibit cropping, displaying only a fraction of the intended scene instead of the complete room.

I am seeking guidance on improving the results when combining objects from different trained categories in a single generated image. The current outcomes do not meet my expectations.

I am struggling to generate complete room images rather than partially cropped images. The issue lies in the generated images consistently displaying only a portion of the desired scene.

My primary objective is to enhance the quality of generated images when working with a single object category. I aim to produce images that closely resemble the objects on which the model was originally trained.

I would greatly appreciate your insights, recommendations, and strategies related to:

  • Any necessary adjustments to hyperparameters to enhance the model’s performance.
  • Augmentation techniques or data preprocessing methods that could potentially improve results.
  • Suggestions for modifying the training process to better accommodate single and combined object categories.
  • Insights or techniques for generating complete room images successfully.

Thank you for your contribution.

What worked for me was doing a 3-concept training for each concept I wanted to merge-- for instance, I wanted to make photos of myself, Chad, and my fiance, Courtney, in different famous places, and since the famous places are already pretty much trained into the base model, I just needed to put our likeness in the new model I was training. But when I did the concept for her, then me, in the same training, it wouldn’t combine us properly. It would be a horrific combination of the two of us, or her with my facial hair, or me with her body… shudder no thank you!

My workaround, that happened to be really, really great and works almost every time, flawlessly, with little to no unexpected or unwanted results, still maintaining it’s ability to change our appearances (i.e. both with silly mustaches, both wearing formal wear, etc), and still being able to use just one, or both, of us, is by using a concept list with 3 concepts, which were:

“photo of Chad” instance and “photo of a man” for class
“photo of Courtney” instance and “photo of a lady” for class
and “:photo of Chad and Courtney” instance and “photo of a couple” for class.

I did 20 photos each for the instance, and 2000 (so 6,000 total) for the class. I trained that at a learning rate of 1-e6, constant, 0 warm-up steps, and let it train on one-to-one with the class images, meaning I did 6,000 steps. I saved a copy of the diffusers model every 500 images, because I wanted to see what’s the best spot. Turns out that 6,000, on the nose, was the best.

I’ve done one of these on the Automatic 1111 webui Dreambooth, as well as on Google Colab. Both came out very similar. For the base model, I usually use any of the photorealistic models I can find on huggingface that are already sectioned out as a diffusers model, like the google colab requires, because I’m too lazy to convert any of my favorite safetensors to diffusers then upload my own model and use that to reference. The automatic 1111 webui makes that part easier, but, if you don’t label your 1.5,2.0,2.1,XL models and if they’re fp16,bf16 or full/float models appropriately, it quickly gets confusing as to why your training keeps failing because your shapes are different and won’t jive.

Without knowing what the objects you’re training are, or if they’re people, or what, but… that’s how I solved my problem. When I prompt with that above mentioned model, I can use “photo of chad” or “photo of courtney” or “photo of chad standing next to a ___ with courtney standing next to a ____” and it works remarkably well every single time.

Hope this helps.

HI @chchchadzilla, I think your approach is what I have been looking for and for some reasons, can’t find anything on it. is it possible for you to share the training config/file for as a starter example? I think the Colab notebook might be even easier I could replicate it locally or use Colab.
I had a few questions regarding the training of the model:

  1. Did you have one model as a base and used 3 instances to train simultaneously?
  2. Did you use separate images for each instance? For example, images of you alone for Chad and so on?
  3. Did you use any regularization images for each instance type? for example, photo of women for Courtney, photos men for yourself, and photos of couples for shared images?

I really appreciate your help. I know I am asking for a lot here.