Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice

HodorTheCoder · July 27, 2021, 2:59am

Hello!

I have a fairly large dataset (ie: 500k sentence pairs) that I’d like to be able to fine tune a similarity model with within the sentence-transformers library. So, sentence A and sentence B, for this specific domain, are more similar than they would be with out of the box encoders.

Problem is, they aren’t labeled, and I don’t have any counter examples-- these are sentences where the sentence A corresponds to sentence B, almost in a quesiton/answer type of way, and I want to be able to create a custom transformer model that is better at assessing similarity of two sentences with my custom dataset for use in a downstream task.

My question is-- will this work with only unlabeled, positively correlated data? Like, should I just give everything a score of 0.9 and run them through? I’m used to heavily balanced datasets and this scares me, but I’ve never fine-tuned one of these before.

Thank you so much.
Hodor

Aryamangadia · April 12, 2024, 6:03am

Hi Hodor,
I am facing a similar problem, I see you faced the issue quite some time back. Do u have a solution for the same??
Any help would be greatly appreciated.

Thanks,
Aryaman

MattiLinnanvuori · April 20, 2024, 9:08am

[Training Overview — Sentence-Transformers documentation]( SentenceTransformers Documentation Training Overview) tells you how to train a transformer with sentence pairs and correlation data and without labels.

Topic		Replies	Views
Fine Tuning A sentence transformer model with my own data Intermediate	2	3095	April 17, 2024
Fine tuning a sentence transformer model for [single_sentence, label] format? 🤗Transformers	0	505	February 13, 2023
Fine-tuning sentence-transformer for retrieval task makes things worse Beginners	0	1726	July 25, 2023
Training a SentenceTransformers for address simliarity Beginners	3	744	March 6, 2024
Sentence transformer poor performance after fine tuning 🤗Transformers	1	1594	September 11, 2022

Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice

Related topics