Guidance on Training Stable Diffusion Models for Image Generation with Multiple Object Categories

livingbox · October 4, 2023, 4:24pm

Hii Community,

I am currently working on training a Stable Diffusion model for image generation, and I have encountered a few challenges that I believe your expertise can help me overcome.

Scenario:
I am in the process of training a Stable Diffusion model using a unique dataset. This dataset consists of images grouped into multiple categories, with each category containing only a single image. Importantly, these images do not have any added noise, and I have also provided captions for each image. The training process is being conducted on Dreambooth.

One of the initial obstacles I encountered was GPU memory exhaustion due to the substantial dataset size. As a workaround, I decided to train the model on a smaller subset of the dataset

Challenges and Issues:

Generating images from the model works relatively well for a single category of objects, problems arise when attempting to combine objects from two different categories. The combined results are not satisfactory, and the generated images often exhibit cropping, displaying only a fraction of the intended scene instead of the complete room.

I am seeking guidance on improving the results when combining objects from different trained categories in a single generated image. The current outcomes do not meet my expectations.

I am struggling to generate complete room images rather than partially cropped images. The issue lies in the generated images consistently displaying only a portion of the desired scene.

My primary objective is to enhance the quality of generated images when working with a single object category. I aim to produce images that closely resemble the objects on which the model was originally trained.

I would greatly appreciate your insights, recommendations, and strategies related to:

Any necessary adjustments to hyperparameters to enhance the model’s performance.
Augmentation techniques or data preprocessing methods that could potentially improve results.
Suggestions for modifying the training process to better accommodate single and combined object categories.
Insights or techniques for generating complete room images successfully.

Thank you for your contribution.

chchchadzilla · October 24, 2023, 10:31pm

What worked for me was doing a 3-concept training for each concept I wanted to merge-- for instance, I wanted to make photos of myself, Chad, and my fiance, Courtney, in different famous places, and since the famous places are already pretty much trained into the base model, I just needed to put our likeness in the new model I was training. But when I did the concept for her, then me, in the same training, it wouldn’t combine us properly. It would be a horrific combination of the two of us, or her with my facial hair, or me with her body… shudder no thank you!

My workaround, that happened to be really, really great and works almost every time, flawlessly, with little to no unexpected or unwanted results, still maintaining it’s ability to change our appearances (i.e. both with silly mustaches, both wearing formal wear, etc), and still being able to use just one, or both, of us, is by using a concept list with 3 concepts, which were:

“photo of Chad” instance and “photo of a man” for class
“photo of Courtney” instance and “photo of a lady” for class
and “:photo of Chad and Courtney” instance and “photo of a couple” for class.

I did 20 photos each for the instance, and 2000 (so 6,000 total) for the class. I trained that at a learning rate of 1-e6, constant, 0 warm-up steps, and let it train on one-to-one with the class images, meaning I did 6,000 steps. I saved a copy of the diffusers model every 500 images, because I wanted to see what’s the best spot. Turns out that 6,000, on the nose, was the best.

I’ve done one of these on the Automatic 1111 webui Dreambooth, as well as on Google Colab. Both came out very similar. For the base model, I usually use any of the photorealistic models I can find on huggingface that are already sectioned out as a diffusers model, like the google colab requires, because I’m too lazy to convert any of my favorite safetensors to diffusers then upload my own model and use that to reference. The automatic 1111 webui makes that part easier, but, if you don’t label your 1.5,2.0,2.1,XL models and if they’re fp16,bf16 or full/float models appropriately, it quickly gets confusing as to why your training keeps failing because your shapes are different and won’t jive.

Without knowing what the objects you’re training are, or if they’re people, or what, but… that’s how I solved my problem. When I prompt with that above mentioned model, I can use “photo of chad” or “photo of courtney” or “photo of chad standing next to a ___ with courtney standing next to a ____” and it works remarkably well every single time.

Hope this helps.

naseerfaheem · January 8, 2024, 7:08pm

HI @chchchadzilla, I think your approach is what I have been looking for and for some reasons, can’t find anything on it. is it possible for you to share the training config/file for as a starter example? I think the Colab notebook might be even easier I could replicate it locally or use Colab.
I had a few questions regarding the training of the model:

Did you have one model as a base and used 3 instances to train simultaneously?
Did you use separate images for each instance? For example, images of you alone for Chad and so on?
Did you use any regularization images for each instance type? for example, photo of women for Courtney, photos men for yourself, and photos of couples for shared images?

I really appreciate your help. I know I am asking for a lot here.

Best,
Naseer

Zelgodiz · May 5, 2025, 4:32am

It sounds like you’re facing challenges with Stable Diffusion training using Dreambooth, particularly when combining multiple object categories and generating complete room images. Here are some strategies to improve your results:

1. Adjust Hyperparameters for Better Object Merging

Increase Training Steps: Dreambooth tends to overfit quickly, so finding the right balance is crucial. Try increasing training steps gradually while monitoring image quality.
Lower Learning Rate: A lower learning rate (e.g., 1e-6) can help prevent overfitting and improve object blending.
Fine-Tune Text Encoder: Training the text encoder alongside the UNet can significantly improve how objects are combined in generated images.

2. Improve Data Augmentation & Preprocessing

Use Diverse Backgrounds: If each category has only one image, the model may struggle to generalize. Try augmenting images with different backgrounds.
Apply Random Cropping & Scaling: This helps the model learn object placement better, reducing unwanted cropping in generated images.
Increase Dataset Size: If possible, add more images per category to improve generalization.

3. Modify Training Process for Multi-Object Scenes

Train Objects Separately, Then Merge: Instead of training all categories together, train them separately and use concept merging techniques.
Use ControlNet for Spatial Guidance: ControlNet can help guide object placement within a scene, ensuring full-room generation instead of cropped outputs.
Experiment with Class Tokens: Assign unique class tokens to different objects to improve their integration in final images.

4. Generate Complete Room Images

Use Wider Aspect Ratios: If your training images are square, try using wider aspect ratios (e.g., 16:9) to encourage full-room generation.
Adjust Prompt Engineering: Instead of simple prompts, use structured descriptions like:
```
"A complete living room with a sofa, table, and lamp, arranged naturally."
```
Test Different Schedulers: Some schedulers (e.g., DDIM) may help improve image composition.

For more details, check out the Hugging Face discussion and Dreambooth training guide. Let me know if you need help implementing these strategies!

Topic		Replies	Views
How to Insert and Maintain Specific Objects or Characters Across Generated Images in Diffusion Models Models	1	279	September 20, 2024
Layout-to-Image Conditioning 🧨 Diffusers	1	1194	September 26, 2023
A couple of super basic questions 🧨 Diffusers	3	1634	November 7, 2022
How to Fine-Tune Pre-Trained Stable Diffusion Models Using Custom Images 🧨 Diffusers	0	505	June 12, 2024
Fine tune diffusion model Beginners	0	2477	September 4, 2023

Guidance on Training Stable Diffusion Models for Image Generation with Multiple Object Categories

1. Adjust Hyperparameters for Better Object Merging

2. Improve Data Augmentation & Preprocessing

3. Modify Training Process for Multi-Object Scenes

4. Generate Complete Room Images

Related topics