Fine-tuning CLIP questions

AmanitaMuscaria · May 19, 2024, 11:21am

Hi I’m trying to fine-tune CLIP (openai/clip-vit-base-patch32) on flower images with a generic caption of “A close-up photo of a [Species name] flower observed by an amateur naturalist”. I have something working but the loss function doesn’t seem to decrease through the epochs. I have ~6,000 image/text pairs of many, many different species of flowers.

My questions:
How diverse should my captions be? Should/could I have multiple captions per image?

My next thought would be to run the images through some model (GPT4? or a less expensive model) to generate captions and append my flower species name into the text somewhere.

Ideas? General guidelines for fine-tuning.
Thanks!

CampbellDorsey · May 21, 2024, 5:43pm

Here are some ideas and general guidelines for fine-tuning CLIP on flower images with your questions in mind:

Caption Diversity and Multiple Captions:

Diversity is good: Having a diverse set of captions will help your model generalize better to unseen flower images and phrases. Aim for a variety of descriptive words and sentence structures.
Multiple captions per image (not necessary, but can be helpful): While not strictly mandatory, using multiple captions per image can provide more information and context to the model. This can be particularly useful if some flower species have multiple common names or descriptions. However, be mindful of redundancy and ensure each caption offers unique information.

Topic		Replies	Views
Image Captioning fine tuning 🤗Transformers	0	440	February 25, 2023
Solution for Fine Tuning the Blip Model 🤗Transformers	0	99	December 13, 2024
CLIPModel finetuning Models	9	9308	July 20, 2022
How to pass CLIP image embeddings to BLIP2 for captioning? Models	1	1122	November 15, 2023
Fine-tune CLIP on satellite images+captions Flax/JAX Projects	14	5065	April 6, 2022

Fine-tuning CLIP questions

Related topics