I have been inspired to create a semantic text search engine for a niche domain, and I am wondering how I should proceed. The basic approach will be to use a transformer model to embed potential results into vectors, use the same model to embed search queries, and then use cosine similarity to compare the query vector with the result vectors.
The main issue I see right now is that it is hard to have good embeddings for a niche domain. From what I can gather, training a model on an NLI task (textual entailment) is best for having good sentence embeddings, but NLI is a supervised task that requires labeled data. The next closest task would be NSP, which can be done without a labeled dataset, but RoBERTa showed that NSP isn’t a good way of training a model. What I’ve noticed other people do for Covid semantic searches is to take SciBERT or BioBERT, train it more on PubMed articles or Cord-19 articles doing MLM, and then finally end with training on an NLI task. I think the NLI task was unrelated to Covid or biology because I don’t know of any in-domain NLI tasks like that.
I have seen joeddav’s blogpost and the recent ZSL pipeline work, and while ZSL is cool and has it’s purposes, it would be ineffective for comparing a search query against thousands or even just hundreds of results in real time.
I have one main question: How should I train a model to generate good sentence vectors in a niche domain?
My current plan is to take a pretrained model, fine-tune it using MLM on in-domain texts, and then do NLI training using SNLI. I am worried that it will be hard to gauge when to stop the NLI training because it seems like the longer it trains, the better it gets at producing sentence-level vectors, but the more it forgets about in-domain information.
Moreover, I’m worried that the fine-tuning using MLM won’t go great because I have tens of thousands of 2-4 sentence chunks rather than long documents.