Hey, I’ve been trying to train Sentence2Vec embeddings and I’ve been wondering what do you think about my approach.
I would be glad to learn about different possible pitfalls in my approach and how to solve them.
What do I have?
- I have a small unique corpus of about 4 million sentences in my test language
- I have a smaller subset (150k) of labeled sentences of whether the sentence is toxic or not.
For this question’s sake assume there isn’t any pretrained model or another existing corpus I can use.
What am I trying to achieve?
The target is to be able to cluster sentences that have similar meanings together.
Way of Action
- Train a RoBERTa language model
- Fine tune it for classification of toxic or not
- Use a hidden layer of that classifier as an embedding model
- Cluster using embeddings and HDBSCAN.
That’s about it, I tried to keep it as clear as I can.
Thanks for anyone that read up until here!!