How to obtain a good sentence embedding?

FrancescoDeSantis · January 31, 2024, 7:58pm

I have a dataset with sentences. The task that I want to complete is unsupervised, therefore I cannot fine tune the pre-trained model. I need to have a sentence embedding which reflects the semantic meaning of the sentence. Until now I tried BERT, using the CLS as sentence embedding. Unfortunatelly, computing the cosine similarity between different embeddings I discovered that all the different sentences are really similar to each other (cosine similarity always around 0.8). This cannot be possible since the sentences in my dataset have really different semantic meaning. Reading on the internet i found out that there is no general consensus about considering the CLS as sentence embedding. Additionally, some paper shows how the pre-trained BERT (no fine tuning) leads to poor sentence embeddings. Which is the best approach in this case? How should I embed the sentences in my dataset knowing that I cannot fine tune a pre trained LLM? Thank you in advance!

moizrauf · June 25, 2024, 8:36am

I have a similar issue, the most viable option seems to be getting word embeddings and then pooling (mean etc) on the vectors to generate a proxy sentence vector. Finally this discussion seems most promising

PS Im hoping you might have found an answer by now and can help me out

MattiLinnanvuori · June 25, 2024, 3:55pm

SentenceTransformers Documentation SentenceTransformers offer an option to get sentence embeddings. You can get embeddings for entire sentences.

Topic		Replies	Views
Generating sentence embeddings from pretrained transformers model Intermediate	1	1101	January 22, 2021
Using Roberta for Sentence2Vec Intermediate	3	1276	April 11, 2021
Getting better sentence embeddings with BERT - is it just pretraining, or it is pretraining + fine tuning? Beginners	2	3209	March 2, 2021
Computing similarity between sentences Intermediate	4	3310	July 31, 2021
What should be used as sentence embedding for BertModel? Beginners	0	1926	May 24, 2021

How to obtain a good sentence embedding?

Related topics