If there is a need for a certain degree of consistency in the objects within the generated image, then a technology with a direction similar to Virtual Try-On may be suitable. Because there are inevitably some ambiguous areas when using diffusion models alone, such as SD and Flux. Another possibility is a special ControlNet such as Flux Edit.