How to get Visual/Text/Multimodal Embedding from llava Model

Hey there,
Currently I am using Llava model
I want to get Visual/Text/Multimodal Embedding from this model.

model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

Atleast final Multimodel embedding from the model.
i want these embedding for building text/image search engine using this model

Can anyone pls advice me on this scenerio :grinning:

Thank you

Hi,

You can pass output_hidden_states to the forward method. In that case, the model will include the hidden_states key which contains the final embeddings of the tokens.

However, note that the embeddings of a model like LLaVa aren’t optimal for retrieval systems, which is not what the model has been trained for. For that I’d recommend multimodal models such as CLIP and SigLIP. There’s also the recent ColPaLi work which obtains substantial performance improvements (it extends the ColBERT model to multimodal use cases).

Hey @nielsr
Thanks for this information that very useful :grinning:
I want build conversational multimodel product search. So that only chose LLaVa model which is more conversational compare to other models

Took some referrence for this qdrant multimodel search implementation:
https://docs.llamaindex.ai/en/latest/examples/multi_modal/ollama_cookbook/

Can you pls advice me if you have any idea on conversational multimodel search. :grinning: