Using Roberta for Sentence2Vec

Maimonator · April 11, 2021, 9:26am

Hey, I’ve been trying to train Sentence2Vec embeddings and I’ve been wondering what do you think about my approach.
I would be glad to learn about different possible pitfalls in my approach and how to solve them.

What do I have?

I have a small unique corpus of about 4 million sentences in my test language
I have a smaller subset (150k) of labeled sentences of whether the sentence is toxic or not.

For this question’s sake assume there isn’t any pretrained model or another existing corpus I can use.

What am I trying to achieve?

The target is to be able to cluster sentences that have similar meanings together.

Way of Action

Train a RoBERTa language model
Fine tune it for classification of toxic or not
Use a hidden layer of that classifier as an embedding model
Cluster using embeddings and HDBSCAN.

That’s about it, I tried to keep it as clear as I can.
Thanks for anyone that read up until here!!

Simpan · April 11, 2021, 9:42am

What is your main goal? Is it to have a classifier for toxic vs non-toxic sentences? Or to cluster sentence embeddings by semantic meaning?

For creating sentence embeddings I would recommend sentence transformers. It’s an extensions of regular huggingface transformers but optimized for creating text embeddings.

Maimonator · April 11, 2021, 10:07am

Cool!
Will take a look at that!
Is there a recommended way to finetune embeddings created by sentence-transofmers?

Simpan · April 11, 2021, 10:31am

Yes there is sample code in the repo Would also recommend to check out the repo homepage https://www.sbert.net/ for good explanations of different use cases.

Topic		Replies	Views
Generating sentence embeddings from pretrained transformers model Intermediate	1	1101	January 22, 2021
How to obtain a good sentence embedding? 🤗Transformers	2	2420	June 25, 2024
Computing similarity between sentences Intermediate	4	3307	July 31, 2021
Sentence embeddings generated by RoBERTa 🤗Transformers	0	242	October 24, 2022
Getting better sentence embeddings with BERT - is it just pretraining, or it is pretraining + fine tuning? Beginners	2	3209	March 2, 2021

Using Roberta for Sentence2Vec

What do I have?

What am I trying to achieve?

Way of Action

Related topics