Multimodal LLM with Image and Text sequentially in its prompt

I am trying to find a Single LLM Model that can Accept both images and text in the prompt.

Models like LLava-13B, FuYu-8B expects Image+text in Every prompt. check here. prompt looks as

“< image >\nUSER: What’s the content of the image?\nASSISTANT:”

Models like Laama, mistral expects only text in Every prompt.

Building 2 individual modules in the pipeline, might lack context while user inputs questions and images in chat. So I am looking for a Single LM.

In my exploration I have found Next-GPT, can have a chat with using both image sometimes and text only sometimes. I want to use a single LLM. But I couldn’t find its integration in HF.

Any help I could use from the forum would be so much helpful. Tagging @sgugger @nielsr @sayakpaul for quick response.

Thank you so much in advance


One such model is CogVLM, which can be used both as a standalone large language model (text-only), accepting images as additional input.

Thank you @nielsr. I tried & failed to load CogVLM in colab. So I used Qwen-7b & has loaded it in colab. I used in my application.

Thank you