What type of model should I use to combine 4 album covers and have a coherent output image

Hey guys, so I’m working on a project which takes 4 album covers and attempts to create an output that mixes all 4 of them, taking their visual elements, art style, colour scheme and so on. Up until this point i’ve been using text-to-image with little success, as I believe a lot of album covers are not used in training data for text-to-image models, the AI does a lot of guesswork and ends up with a cluttered image that is very unclear about its’ influences. Because of this, I think perhaps an image-to-image AI model would be better, I could feed it the 4 album covers directly and ask it to mix them via some sort of text prompt. I’ve looked across the web but have not been able to find anything like this. I’m hoping you guys could point me in the right direction, maybe we have some huggingface models that I haven’t seen which can be used via an API. Thank you!