How to get Visual/Text/Multimodal Embedding from llava Model

Samvardhan777 · August 24, 2024, 8:39pm

Hey there,
Currently I am using Llava model
I want to get Visual/Text/Multimodal Embedding from this model.

model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

Atleast final Multimodel embedding from the model.
i want these embedding for building text/image search engine using this model

Can anyone pls advice me on this scenerio

Thank you

nielsr · August 26, 2024, 1:04pm

Hi,

You can pass output_hidden_states to the forward method. In that case, the model will include the hidden_states key which contains the final embeddings of the tokens.

However, note that the embeddings of a model like LLaVa aren’t optimal for retrieval systems, which is not what the model has been trained for. For that I’d recommend multimodal models such as CLIP and SigLIP. There’s also the recent ColPaLi work which obtains substantial performance improvements (it extends the ColBERT model to multimodal use cases).

Samvardhan777 · August 26, 2024, 6:27pm

Hey @nielsr
Thanks for this information that very useful
I want build conversational multimodel product search. So that only chose LLaVa model which is more conversational compare to other models

Took some referrence for this qdrant multimodel search implementation:
https://docs.llamaindex.ai/en/latest/examples/multi_modal/ollama_cookbook/

Can you pls advice me if you have any idea on conversational multimodel search.

DinithiJ · December 11, 2024, 5:11am

Hi @Samvardhan777
Did you find a way to this ? I’m also searching for a way to get Visual/Text/Multimodal Embedding from this model.

Topic		Replies	Views
Embeddings from llama2 🤗Transformers	6	12439	December 13, 2023
FlavaModel multimodal_embeddings shape and text_embeddings shape is not match 🤗Transformers	0	17	December 23, 2024
Image Embedding from PaliGemma Model Beginners	7	698	March 5, 2025
Multimodal LLM with Image and Text sequentially in its prompt 🤗Transformers	2	12611	January 1, 2024
LayoutLMV3 embeddings Beginners	4	1129	August 3, 2022

How to get Visual/Text/Multimodal Embedding from llava Model

Related topics