Duplicate inputs in contrastive loss e.g. CLIP

I’m working on fine-tuning CLIP but I think my question applies generally to models with contrastive losses: what happens when duplicate labels/captions appear in the training set?

I have a dataset meant for benchmarking few-shot image classifiers that I want to use for fine-tuning CLIP. It has 5-100 examples of each of its >1k classes. If a batch happens to take two examples of the same class, then two of the labels are identical and should get the same score in each image’s row of the similarity score matrix. For a given image of that class, one of those scores is on the diagonal but the other is off-diagonal. The scores are yoked to each other, so the best the optimizer can do at the end of training is minimize all other scores in the image’s row. Ideally we would be able to ignore the repeated label/class.

Not only does this mess up the loss a little, but it messes up the top-k accuracies. If image A gets confused for label B, but label B is repeated 3 times in the batch, then we call that a top-3 miss when we should have called it a top-3 hit.

Surely in the original dataset used to train CLIP or other huge image-caption datasets e.g. LAION there must be repeated labels. What is done in practice? Is special care taken to make sure two images with the same caption never appear in a batch? How does this work when you’re training on batches of 10k+ images as OpenAI did when training CLIP?