STS on a niche domain

HuggySSO · December 23, 2023, 2:07pm

I’m trying to solve an STS problem. The task is to find the n documents with the highest relevance for each new document. My current approach involves generating the document embedding for each document and using Euclidean distance to search in a aproximate neares neughbour search tree (Annoy) for the n nearest documents. However, the issue is that the documents vary significantly in their length and language usage, concealing the actual core of the documents in some cases. A very simplified example:

Report A: “My car needs servicing again, please replace the air-filters.”
Report B: “We were planning to go on vacation next week, but the ventilation isn’t working properly. Perhaps it has something to do with the filters. Please change them at the next appointment.”

Both reports discuss the need for a replaced airfilter, but with different focuses and length.

The documents in my real-world problem are approximately 10-200 tokens long and cover a broad and deeply specialized technical domain with synonyms, abbreviations, etc. Unfortunately, the average document vector is not sufficient to recognize these nuances. I’ve been considering training a summary that extracts the core of the problem from each document and using that vector representation. But would like to know what the current sota approach for this problem is.

For reference i have about 250mb of unlabeld domain-specific text. I also startet generating a labeled trainingset of similar documents (only a few hundred examples) that i’ve used to finetune different pretrained models. So far the results are sobering

Topic		Replies	Views
Word, sentence or long context embedding? Beginners	0	366	March 8, 2024
Training for sentence vectors in niche domain Intermediate	18	3286	February 16, 2021
Ideas for better cross-corpus similarity scoring 🤗Transformers	0	161	July 16, 2023
Domain-specific word similarity problem Awesome paper	2	846	July 19, 2023
Text input bigger than max tokens length for semantic search embeddings Beginners	1	1573	May 29, 2024

STS on a niche domain

Related topics