How to produce a correct embedding from a multimodal (vision_language) model for a dataset?

gdibenedetto39 · May 29, 2023, 8:22am

Hi, i was looking for a solution on various papers, but I’ve not found the answer anywhere. I have a csv dataset, I would like to produce an embedding from a multimodal model for each value of my csv, considering that I’m passing a string and an array representing an image as the inputs to the model. In the current state-of-the-art which is the better way to make the embedding for each value? Do i need to extract it from the model by passing the whole dataset as input, or is it better if i give a single row of the dataset at the time?

Topic		Replies	Views
How to get Visual/Text/Multimodal Embedding from llava Model Beginners	3	1359	December 11, 2024
Multimodal datasets and corresponding models Beginners	2	77	March 12, 2025
Fine-tunening a multimodal model Beginners	4	4989	December 25, 2024
FlavaModel multimodal_embeddings shape and text_embeddings shape is not match 🤗Transformers	0	15	December 23, 2024
Multimodal fusion options - thoughts? Research	0	22	May 6, 2025

How to produce a correct embedding from a multimodal (vision_language) model for a dataset?

Related topics