How to make a model respect static UI layout and generate only overlay characters (ControlNet / SDXL / IPAdapter?)

Hi everyone! :waving_hand:
We’re building a Web3 customization tool and are currently working on an AI pipeline where the model should:

  • Understand that the UI layout in the center (a wallet login screen) is not to be redrawn
  • Generate only one object or character that interacts with the interface (e.g., leans toward a button, sits beside it, etc.)
  • Return a transparent PNG, without adding background or modifying the UI
  • Ideally support prompt + guide image + layout-awareness

:package: What we already tried:

:white_check_mark: We created a full JSON representation of the wallet layout:
Including positions, button labels, sizes, safe zones, and colors.

:white_check_mark: We also generate a guide image of the UI as a reference (Phantom Wallet login mockup)

:white_check_mark: We built a promptBuilder.ts that merges:

  1. Hard-coded constraints (Do not cover the interface, etc.)
  2. The layout as descriptive text (Unlock button at x:470, y:490)
  3. User prompt (e.g., “Pepe touches the unlock button”)

:white_check_mark: Then we tested:

  • lucataco/sdxl-controlnet (:warning: now returns 404)
  • stability-ai/stable-diffusion-xl-base-1.0 via HuggingFace API
  • IPAdapter in local pipelines
  • ComfyUI to build manual graph workflows

:cross_mark: Issues we face:

  • Most models tend to redraw the UI layout, even when told not to
  • Background often reappears (even with transparent prompts)
  • Character generation isn’t aware of UI boundaries (like “don’t cover the Unlock button”)
  • IPAdapter respects style, but lacks fine-grained interaction control

:bullseye: Our ideal model:

We’re looking for a model (or combo) that can:

  • Accept both image + prompt + optional JSON or mask
  • Draw only the new character (no background, no UI duplication)
  • Ideally supports ControlNet mask or fine spatial constraints
  • Returns PNG with transparency

:handshake: What we’d love from the community:

  • Any suggestions for models or pipelines that could help?
  • Has anyone tried layout-aware generation like this?
  • Would custom ControlNet training or DreamBooth variant help here?

We’re happy to share more screenshots or JSON layouts if needed.

Thanks in advance — this forum has been super helpful for us so far :folded_hands:

1 Like

In terms of prompt comprehension, FLUX comes to mind. It might also be worth trying Inpainting. If developing from a pipeline, VTON might be a similar approach.