Is there specific generative model to describe User Interfaces?

There is a lot of generative image-to-text models hosted on huggingface.
But is there any specific model to describe user interfaces?
All models that I checked are common, i.e. they recognize not domain-specific objects but commonly used.
I want the model to describe UI by given screenshot. E.g. when I give the below screenshot the model writes smth like: “There are options: File, Edit, View and some groups of icons: Selection, Image, Tools, Brushes”. Or it could be without text recognition: "Upper part and lower part are light blue, middle part is almost white and contains a lot of icons "

5 Likes

Like these?

2 Likes

@John6666 Thanks, yes, exactly.
Are there smth ready to use? I read the paper about ScreenAI but can’t find ScreenAI model on HF. The second model from Xiaomi also looks cool but github description probably suggests it should be trained before usage.
The ideal would be ready to use model that could be downloaded from HuggingFace

1 Like

Hmm… I can’t find UI image recognition models that can be used with Hugging Face…

Is a general-purpose VLM that accepts prompts not good enough? It can’t be used for very precise applications, though.
The example below is Qwen 2.5 VL 32B, but Aya Vision also has very good performance, and Florence 2 and Paligemma 2 are well-known for smaller ones. If you’re looking for LLM performance, try Llava.


Yes, it is what I need. I’ll try this models. Thanks a lot