Input image and a question about the image and get a result

Here’s a revised version of your post:

Hello everyone,

I’m interested in developing software where I can input an image of a bathroom and then ask questions like, “What condition is the bathroom in?” and “Which decade does the bathroom appear to be from?”. I’ve tried using the sceneExplain plugin with ChatGPT, but the results have been off, suggesting that 40-year-old, worn-out bathrooms are in great condition.

I have a decent background in programming, so I believe I might need to train a model myself. However, I’m unsure about which categories of models are best suited for this purpose and which ones I can train on my own.

Can anyone provide guidance on the best models for this task, or perhaps link me to a tutorial on how to train them?


You’re in luck cause Hugging Face released a model that can do just that today. The model is called IDEFICS, and can be seen as a ChatGPT model that can also take arbitrary sequences of images as input.

Here’s an example on a random bathroom image:

Do note that IDEFICS is pretty large (there are 2 sizes, 9 billion and 80 billion parameters). You can also train much smaller models to do this, like ViLT, BLIP or InstructBLIP.