Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice

Hello!

I have a fairly large dataset (ie: 500k sentence pairs) that I’d like to be able to fine tune a similarity model with within the sentence-transformers library. So, sentence A and sentence B, for this specific domain, are more similar than they would be with out of the box encoders.

Problem is, they aren’t labeled, and I don’t have any counter examples-- these are sentences where the sentence A corresponds to sentence B, almost in a quesiton/answer type of way, and I want to be able to create a custom transformer model that is better at assessing similarity of two sentences with my custom dataset for use in a downstream task.

My question is-- will this work with only unlabeled, positively correlated data? Like, should I just give everything a score of 0.9 and run them through? I’m used to heavily balanced datasets and this scares me, but I’ve never fine-tuned one of these before.

Thank you so much.
Hodor

Hi Hodor,
I am facing a similar problem, I see you faced the issue quite some time back. Do u have a solution for the same??
Any help would be greatly appreciated.

Thanks,
Aryaman