What would be the best image-to-text model for a lot of images?

I got more than 1,000,000 images which I need to describe with text (with 75 words/tokens or less).

I’ve tried using CLIP and BLIP, but I find them fairly slow, as well as many times they yield unsatisfying results. I also wanted to experiment with BLIP 2, but don’t have the hardware to run it (I guess I could pay for cloud computing to run it, but I don’t know if it’s worth it nor how fast it would be). Added to that, I searched for other alternatives, but none seemed promising enough and got no one else to ask for advise.

What do you think that could be solution to this problem (considering that I mostly care about the speed, but also a bit about the quality of the descriptions)? :thinking: