How to produce a correct embedding from a multimodal (vision_language) model for a dataset?

Hi, i was looking for a solution on various papers, but I’ve not found the answer anywhere. I have a csv dataset, I would like to produce an embedding from a multimodal model for each value of my csv, considering that I’m passing a string and an array representing an image as the inputs to the model. In the current state-of-the-art which is the better way to make the embedding for each value? Do i need to extract it from the model by passing the whole dataset as input, or is it better if i give a single row of the dataset at the time?

1 Like