What is an efficient method to manually create image descriptions?

I want to add descriptions to a few thousand images and I’m looking for an efficient way to do this. Ideally I’d like something on Android where I see the image, I can speak the description, it gets transcribed to text and stored in some way with the image. Then I click next/OK, see the next image and repeat.

Has anyone done something similar or have an idea of how they would do it?

1 Like

The process of adding descriptions to a large number of images is usually done semi-automatically by a tool or VLM like the following, for example, but it is a rare use case when it is only done manually…
I think it is possible to achieve your flow using an ASR model such as Whisper, but I have not seen such a finished product in Spaces, so I think the only way is to create one. If you want to find or create something similar, I can provide you with information.

Thanks for the input, John. If I end up building something it seems like Whisper would be the best option for the ASR portion.

1 Like

If you are going to use Whisper, the following one seems to be fast and good, although it requires a GPU.
The flow of the program that I personally thought of is to put 1000 image files in a private dataset repo in HF, display one of them in the GUI, accept voice input in Whisper and put it in a text box, and improve the contents of the text box by combining an appropriate grammar checker, When the Submit button is pressed, a .txt file is saved in the dataset repo with the same name as the image file, only with a different extension. and the following image is displayed. Images for which .txt is found are not displayed because they have already been processed.
I think you can make something like this using only common existing functions.
It would be nice to put an appropriate VLM or tagger in front of Whisper to aid input.