You can pass output_hidden_states to the forward method. In that case, the model will include the hidden_states key which contains the final embeddings of the tokens.
However, note that the embeddings of a model like LLaVa aren’t optimal for retrieval systems, which is not what the model has been trained for. For that I’d recommend multimodal models such as CLIP and SigLIP. There’s also the recent ColPaLi work which obtains substantial performance improvements (it extends the ColBERT model to multimodal use cases).
Hey @nielsr
Thanks for this information that very useful
I want build conversational multimodel product search. So that only chose LLaVa model which is more conversational compare to other models