Hi I’m trying to fine-tune CLIP (openai/clip-vit-base-patch32) on flower images with a generic caption of “A close-up photo of a [Species name] flower observed by an amateur naturalist”. I have something working but the loss function doesn’t seem to decrease through the epochs. I have ~6,000 image/text pairs of many, many different species of flowers.
My questions:
How diverse should my captions be? Should/could I have multiple captions per image?
My next thought would be to run the images through some model (GPT4? or a less expensive model) to generate captions and append my flower species name into the text somewhere.
Ideas? General guidelines for fine-tuning.
Thanks!
Here are some ideas and general guidelines for fine-tuning CLIP on flower images with your questions in mind:
Caption Diversity and Multiple Captions:
- Diversity is good: Having a diverse set of captions will help your model generalize better to unseen flower images and phrases. Aim for a variety of descriptive words and sentence structures.
- Multiple captions per image (not necessary, but can be helpful): While not strictly mandatory, using multiple captions per image can provide more information and context to the model. This can be particularly useful if some flower species have multiple common names or descriptions. However, be mindful of redundancy and ensure each caption offers unique information.
1 Like