Duplicate inputs in contrastive loss e.g. CLIP

npyoung-mg · July 8, 2024, 2:10am

I’m working on fine-tuning CLIP but I think my question applies generally to models with contrastive losses: what happens when duplicate labels/captions appear in the training set?

I have a dataset meant for benchmarking few-shot image classifiers that I want to use for fine-tuning CLIP. It has 5-100 examples of each of its >1k classes. If a batch happens to take two examples of the same class, then two of the labels are identical and should get the same score in each image’s row of the similarity score matrix. For a given image of that class, one of those scores is on the diagonal but the other is off-diagonal. The scores are yoked to each other, so the best the optimizer can do at the end of training is minimize all other scores in the image’s row. Ideally we would be able to ignore the repeated label/class.

Not only does this mess up the loss a little, but it messes up the top-k accuracies. If image A gets confused for label B, but label B is repeated 3 times in the batch, then we call that a top-3 miss when we should have called it a top-3 hit.

Surely in the original dataset used to train CLIP or other huge image-caption datasets e.g. LAION there must be repeated labels. What is done in practice? Is special care taken to make sure two images with the same caption never appear in a batch? How does this work when you’re training on batches of 10k+ images as OpenAI did when training CLIP?

Topic		Replies	Views
Fine-tuning CLIP questions Beginners	1	547	May 21, 2024
Implementing a Custom Contrastive Trainer for Code Embeddings with Multiple Correct and Incorrect Solutions Intermediate	1	26	March 25, 2025
How is additional text information used for image classification using CLIP? Beginners	0	450	November 5, 2023
How to fix labels on an AutoTrained model Beginners	1	265	May 31, 2022
CLIP Image to Text search Beginners	0	896	December 19, 2022

Duplicate inputs in contrastive loss e.g. CLIP

Related topics