I have a use case in which the user enters the details of the problem his computer is facing and the system will return the most relevant article from the list of articles that potentially might solve the issue.
I tried cosine similarity and BM25 but not getting decent results. Can someone please suggest which pre-trained model i can use for this kind of data and use-case.
Sorry I am new to transformers.
hey @gladmortal your use case sounds like a good match for dense retrieval, where you use a transformer like BERT to embed your documents as dense vectors and then measure their similarity to a query vector.
you can find an example on how to do this with FAISS and the
datasets library here: Adding a FAISS or Elastic Search index to a Dataset — datasets 1.6.2 documentation
there is also a nice example from
sentence-transformers here if your articles are not too long: Semantic Search — Sentence-Transformers documentation
alternatively, if you have a corpus of
(question, article) tuples, you could try doing a similarity comparison of new questions against the existing ones and using the matches to return the most relevant article. there’s a tutorial of this from a nice library called
haystack which is built on
transformers here: https://haystack.deepset.ai/docs/latest/tutorial4md
Thanks @lewtun , This is really helpful. Can you also help me understand how can i determine which model (BERT, roBERTa, distilBERT - Index of /reimers/sentence-transformers/v0.2/ ) is gonna work well in my use case, especially the embeddings where people are seeking information on hardware/software related concerns, Do i need to train my own embedding in the case ?
A good way it to look at models listed here:
There you find different categories of models with some performance results.
For your use-case, I recommend models trained on MS MARCO:
These models were trained for the use case that the user inputs a search query and the system searches for relevant passages that provide the answer.
A model that works well is ’ msmarco-distilbert-base-tas-b’ combined with dot product
Thanks @nreimers. sbert is amazing and i wish you good luck with your journey with huggingface.
I have a doubt:
As suggested by you I am using ‘msmarco-distilbert-base-tas-b’ as bi_encoder and using the example mentioned here: sentence-transformers/semantic_search_wikipedia_qa.py at master · UKPLab/sentence-transformers · GitHub
I have a total of 4000 articles and the average size of each article is 3500 characters. Do i need to use a cross-encoder as well? , I am fine with showing the complete article to the user instead of just a part of it.
A cross-encoder can give an additional performance boost, but comes at additional compute overhead. So you can choose what is more important: Lower latency and less compute: Just use a bi-encoder
Higher latency, but potentially better results: Combine with a cross-encoder
For 4k articles I think just using a bi-encoder will be fine.
If your article are substantially longer than 500 word pieces, it might make sense to break them down to paragraphs and to encode these individually. Bi-encoder and cross-encoder both have a length limit of 512 word pieces (which is about 300-400 words).
If bi-encoder and cross-encoder have a limit of 300-400 words, what happens if we give an article with more than the limit, will it just take the initial 400 words and ignore everything after that? , Also even if we split an article into 5 to 6 parts and encode them individually how we are going to maintain the context will it not split the context as well considering 1 article is a solution for 1 issue?
Yes, it will just take the initial 512 word pieces. Everything after that is ignored.
It depends on your specific task if this is a problem. Often, you can judge from the first 512 word pieces if a text is relevant for the query or not.
When you split it up into multiple paragraphs, you usually don’t keep the context. Often, keeping the context is not really needed as you can judge quite good from the paragraph if it is relevant to the query or not.
If you have a title (e.g. like you have it for wikipedia), it can make sense to encode it with title [SEP] paragraph. This keeps some of the context.