What is an efficient method to manually create image descriptions?

I want to add descriptions to a few thousand images and I’m looking for an efficient way to do this. Ideally I’d like something on Android where I see the image, I can speak the description, it gets transcribed to text and stored in some way with the image. Then I click next/OK, see the next image and repeat.

Has anyone done something similar or have an idea of how they would do it?

1 Like

The process of adding descriptions to a large number of images is usually done semi-automatically by a tool or VLM like the following, for example, but it is a rare use case when it is only done manually…
I think it is possible to achieve your flow using an ASR model such as Whisper, but I have not seen such a finished product in Spaces, so I think the only way is to create one. If you want to find or create something similar, I can provide you with information.

Thanks for the input, John. If I end up building something it seems like Whisper would be the best option for the ASR portion.

1 Like

If you are going to use Whisper, the following one seems to be fast and good, although it requires a GPU.
The flow of the program that I personally thought of is to put 1000 image files in a private dataset repo in HF, display one of them in the GUI, accept voice input in Whisper and put it in a text box, and improve the contents of the text box by combining an appropriate grammar checker, When the Submit button is pressed, a .txt file is saved in the dataset repo with the same name as the image file, only with a different extension. and the following image is displayed. Images for which .txt is found are not displayed because they have already been processed.
I think you can make something like this using only common existing functions.
It would be nice to put an appropriate VLM or tagger in front of Whisper to aid input.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.