Looking for Model Architecture for Instant Multi-Layer Virtual Try-On (Avatar + Outfit Layering)

Hi everyone

We’re building an instant virtual try-on application where users can:

  • Upload their own avatar (full-body image)

  • Separately upload different outfit items such as t-shirts, jackets, hoodies, etc.

  • Then try them on instantly, with proper multi-layering support (for example, a jacket correctly layering over a t-shirt).

We’ve tested several VTO and segmentation models from Hugging Face and other sources, but so far none of them support proper real-time layering between multiple clothing items and the human body.

Our main goals are:

  • Realistic and accurate multi-layer garment compositing

  • Instant processing speed (ideally within 1–5 seconds per try-on)

  • Ability to segment human and each clothing item separately

  • Blend layers naturally (maintaining texture, folds, and depth)

We’d love some guidance on:

  1. What model architecture or pipeline we should follow for this kind of instant layering try-on system

  2. Any pretrained models or open-source frameworks that already support multi-layer virtual try-on

  3. Whether we should combine human parsing + clothing segmentation + warping models, and if so, how to structure that

Any advice, references, or model suggestions would be hugely appreciated

1 Like