I am trying to find a Single LLM Model that can Accept both images and text in the prompt.
Models like LLava-13B, FuYu-8B expects Image+text in Every prompt. check here. prompt looks as
“< image >\nUSER: What’s the content of the image?\nASSISTANT:”
Models like Laama, mistral expects only text in Every prompt.
Building 2 individual modules in the pipeline, might lack context while user inputs questions and images in chat. So I am looking for a Single LM.
In my exploration I have found Next-GPT, can have a chat with using both image sometimes and text only sometimes. I want to use a single LLM. But I couldn’t find its integration in HF.
Any help I could use from the forum would be so much helpful. Tagging @sgugger @nielsr @sayakpaul for quick response.
Thank you so much in advance