Multimodal LLM with Image and Text sequentially in its prompt

purnasai · December 10, 2023, 6:07am

I am trying to find a Single LLM Model that can Accept both images and text in the prompt.

Models like LLava-13B, FuYu-8B expects Image+text in Every prompt. check here. prompt looks as

“< image >\nUSER: What’s the content of the image?\nASSISTANT:”

Models like Laama, mistral expects only text in Every prompt.

Building 2 individual modules in the pipeline, might lack context while user inputs questions and images in chat. So I am looking for a Single LM.

In my exploration I have found Next-GPT, can have a chat with using both image sometimes and text only sometimes. I want to use a single LLM. But I couldn’t find its integration in HF.

Any help I could use from the forum would be so much helpful. Tagging @sgugger @nielsr @sayakpaul for quick response.

Thank you so much in advance

nielsr · December 10, 2023, 5:35pm

Hi,

One such model is CogVLM, which can be used both as a standalone large language model (text-only), accepting images as additional input.

purnasai · January 1, 2024, 11:36am

Thank you @nielsr. I tried & failed to load CogVLM in colab. So I used Qwen-7b & has loaded it in colab. I used in my application.

Thank you

Topic		Replies	Views
LLaVA multi-image input support for inference Models	8	7356	August 30, 2024
Any Multi Modal LLMs that take direct pdf + text as input? 🤗Transformers	2	1849	October 10, 2024
Model that can generate both text and image as output Research	5	1352	December 31, 2024
Which Transformers/Libraries Should I use? Beginners	2	222	December 17, 2024
I fine-tuned the llama language model, but how should I adjust the prompt? Beginners	2	47	May 5, 2025

Multimodal LLM with Image and Text sequentially in its prompt

Related topics