What would be the best image-to-text model for a lot of images?

Jorvan · November 8, 2023, 7:44pm

I got more than 1,000,000 images which I need to describe with text (with 75 words/tokens or less).

I’ve tried using CLIP and BLIP, but I find them fairly slow, as well as many times they yield unsatisfying results. I also wanted to experiment with BLIP 2, but don’t have the hardware to run it (I guess I could pay for cloud computing to run it, but I don’t know if it’s worth it nor how fast it would be). Added to that, I searched for other alternatives, but none seemed promising enough and got no one else to ask for advise.

What do you think that could be solution to this problem (considering that I mostly care about the speed, but also a bit about the quality of the descriptions)?

Topic		Replies	Views
CLIP Image to Text search Beginners	0	898	December 19, 2022
Blip-2 for extraction of image and text embeddings 🤗Transformers	0	616	September 20, 2024
Support for different models in text-to-image pipeline 🤗Transformers	1	540	January 13, 2023
Image to text model that can take an additional text input 🤗Transformers	1	280	October 2, 2023
What model to use? Beginners	0	56	July 11, 2024

What would be the best image-to-text model for a lot of images?

Related topics