Using Roberta for Sentence2Vec

Hey, I’ve been trying to train Sentence2Vec embeddings and I’ve been wondering what do you think about my approach.
I would be glad to learn about different possible pitfalls in my approach and how to solve them.

What do I have?

  1. I have a small unique corpus of about 4 million sentences in my test language
  2. I have a smaller subset (150k) of labeled sentences of whether the sentence is toxic or not.

For this question’s sake assume there isn’t any pretrained model or another existing corpus I can use.

What am I trying to achieve?

The target is to be able to cluster sentences that have similar meanings together.

Way of Action

  1. Train a RoBERTa language model
  2. Fine tune it for classification of toxic or not
  3. Use a hidden layer of that classifier as an embedding model
  4. Cluster using embeddings and HDBSCAN.

That’s about it, I tried to keep it as clear as I can.
Thanks for anyone that read up until here!!

What is your main goal? Is it to have a classifier for toxic vs non-toxic sentences? Or to cluster sentence embeddings by semantic meaning?

For creating sentence embeddings I would recommend sentence transformers. It’s an extensions of regular huggingface transformers but optimized for creating text embeddings.

1 Like

Will take a look at that!
Is there a recommended way to finetune embeddings created by sentence-transofmers?

Yes there is sample code in the repo :slight_smile: Would also recommend to check out the repo homepage for good explanations of different use cases.

1 Like