Training for sentence vectors in niche domain

Hi everyone,

I have been inspired to create a semantic text search engine for a niche domain, and I am wondering how I should proceed. The basic approach will be to use a transformer model to embed potential results into vectors, use the same model to embed search queries, and then use cosine similarity to compare the query vector with the result vectors.

The main issue I see right now is that it is hard to have good embeddings for a niche domain. From what I can gather, training a model on an NLI task (textual entailment) is best for having good sentence embeddings, but NLI is a supervised task that requires labeled data. The next closest task would be NSP, which can be done without a labeled dataset, but RoBERTa showed that NSP isn’t a good way of training a model. What I’ve noticed other people do for Covid semantic searches is to take SciBERT or BioBERT, train it more on PubMed articles or Cord-19 articles doing MLM, and then finally end with training on an NLI task. I think the NLI task was unrelated to Covid or biology because I don’t know of any in-domain NLI tasks like that.

I have seen joeddav’s blogpost and the recent ZSL pipeline work, and while ZSL is cool and has it’s purposes, it would be ineffective for comparing a search query against thousands or even just hundreds of results in real time.

I have one main question: How should I train a model to generate good sentence vectors in a niche domain?

My current plan is to take a pretrained model, fine-tune it using MLM on in-domain texts, and then do NLI training using SNLI. I am worried that it will be hard to gauge when to stop the NLI training because it seems like the longer it trains, the better it gets at producing sentence-level vectors, but the more it forgets about in-domain information.

Moreover, I’m worried that the fine-tuning using MLM won’t go great because I have tens of thousands of 2-4 sentence chunks rather than long documents.


Hi @nbroad , interesting question.
What kind of Niche domain do you consider?
Since nowadays we have several hundreds (if not thousands) of NLP datasets, is it possible to find similar datasets for pre-MLM before final-MLM by your own data ?

I don’t have direct experience on sentence similarity training, but I once trained a classifier on multi-langauges Toxic-comment domain (maybe a bit niche) where finetuning with MLM did improve the performance compared to non-MLM.

1 Like

Hi Jung, thanks for the reply.

If I had to pick a domain name, I would say medical notes. It contains a lot of technical terms but it is written quickly with abbreviations and acronyms galore. I am planning on using BioBert, SciBert, or ClinicalBert as the pretrained base model.

I think the issue I have is that there isn’t an NLI dataset I can use for my niche domain. If I understand correctly, any data source can be used for MLM and NSP, but I don’t have the means of creating a labeled textual entailment (NLI) dataset, and I’ve heard that this type of dataset is the best for creating good sentence vectors.

See this paper and this article for reference about sentence embeddings.

1 Like

I encountered a similar problem. In zero-shot learning condition, the embedding of sentence looks good when they have similar length, but when using embedding of query which is really short, the embedding is much worse. The embedding of query is much closer to the short rubbish data other than the long relative data. Still looking for good and simple training task to solve the problem.

@nbroad Thanks for clarification!
May I ask that if, given a sentence embedding model M1 and another sentence embedding model M2, do you have a solid metric to determine whether M1 is better than M2 or not in your problem domain ?

I don’t have a great metric, but what I’ve been doing is a mock real application. I use the embedder to convert ~100 examples to embeddings. I then turn a few queries into vectors and see which examples are closest using cosine similarity. I have a list of 5 queries and a general idea of what results I would want to see for each query.

@joeddav, do you have any ideas? I’d be interested to hear your perspective on this matter.

To the thread - medical notes related tasks are rather niche because it involves a lot of domain expertise (meaning labeling is time-consuming and super expensive) and there’s really not that much data to train on compared to other domains.

I’ve been wondering about the same situation myself (medical domain with clinical notes or needing to do literature search). I found that sentence vectors (using sentence transformers) have some limitations. It doesn’t work all that well with the pre-trained models and I found that the similarity scores more or less act as fancy regex functions, and doesn’t do so much to capture semantics.

You saw this as well in the Cord-19 Kaggle challenge where everyone implemented roughly the same idea but none of the results were that convincing.

I think there’s a couple of things you could potentially explore:

  1. A pretrained paraphrase task may be better than similarity task
  2. Going back to simple vectors (like fastText) and doing your search query on those embedded terms (but this really will only benefit literature search rather than notes due to corpus size)
  3. Knowledge graph creation with embeddings.
  4. Simple training exercise

Hi @Weilin,

Thank you for the response. I am wondering if you would be able to expand on your suggestions or point me to some resources that would help.

I do agree that I have noticed that sentence vectors do act as a fancy regex function, but I feel like it has potential for semantic similarity! Still, maybe good sentence vectors for semantic similarity aren’t a thing just yet.

Regarding your suggestions.

A pretrained paraphrase task may be better than similarity task

Do you mean it would be better to train the model on a paraphrasing task? Or do you mean that the end application should use paraphrasing and not similarity?

Going back to simple vectors (like fastText) and doing your search query on those embedded terms (but this really will only benefit literature search rather than notes due to corpus size)

As far as I know, fastText does not work well on word phrases (sentences) so this approach would have to embed keywords from the search query as well as embedding the notes in a similar fashion. Am I understanding you correctly?

Knowledge graph creation with embeddings.

This I don’t know much about, but if you have good resources on it, I’d be interested in learning more about it.

Simple training exercise

Could you elaborate?

Not sure if transformer models are required here. I think you should be fine with something like sent2vec and doing similarity searches with faiss.

This recent approach (labse) by Google might be useful, too as they do the steps above in one go, it seems.

Hi @BramVanroy,

Thanks for the reply. I’m a little embarrassed to admit that I didn’t know about sent2vec. I’ll give that a shot and report back on how it did.

LaBSE also looks interesting, but when I read through the paper I noticed this:

We observe that LaBSE performs worse on pairwise English semantic similarity than other
sentence embedding models. This result contrasts with its excellent performance on crosslingual bi-text retrieval. The cross-lingual m-USE model notably achieves the best overall performance, even outperforming SentenceBERT when SentenceBERT is not fine-tuned for the STS task.
We suspect training LaBSE on translation pairs biases the model to excel at detecting meaning equivalence, but not at distinguishing between fine grained degrees of meaning overlap.

Seems like it does far better for similarity across languages, but not within the same language. I appreciate you sharing it, though! I had heard of USE but it looks like I have plenty more to research and try. I think this might just be scenario where I will have to try multiple different models and see what happens. Thanks!

@nbroad can you elaborate further your specific use case? I was more speaking in a general sense


I am basically making a semantic search for medical notes. Very technical and domain-specific terminology with a lot of shorthand abbreviations and spelling mistakes. I would like to go beyond tfidf for searching through the notes and hopefully be able to find notes based on the semantic meaning of the query and the note.

Hey! any solutions for this problem you found?basically I worked with tfidf and then used cosine similarity but I’d like to see if BERT can be used for this?
task is really simple:- given a sentence and a corpus containing probably a million rows return the top 10 most similar sentences to the one user inputs(kinda what quora or stackoverflow does)

Hi @ravijoe, as suggested earlier in the thread, you could try using FAISS for the similarity search on the BERT embeddings. There’s many tutorials online, eg

Oh thanks for this! and this doesn’t require any type of fine tuning right ?

one more thing, what if we use tfidf+cosine similarity for this kinda task do you think it’ll perform good ?

Tfidf works well for hard matches. Bert will help with semantic similarity even if the words don’t match. If your query is “why is it so expensive?” Bert would find results that mention high price and significant cost, but tfidif would probably only match results with the word expensive (or whatever the lemma/stem for expensive is)

1 Like

Ok so I’m short I should encode using Bert m then use faiss for indexing right ? N no fine tuning is required in this ?