I’m trying to solve an STS problem. The task is to find the n documents with the highest relevance for each new document. My current approach involves generating the document embedding for each document and using Euclidean distance to search in a aproximate neares neughbour search tree (Annoy) for the n nearest documents. However, the issue is that the documents vary significantly in their length and language usage, concealing the actual core of the documents in some cases. A very simplified example:
Report A: “My car needs servicing again, please replace the air-filters.”
Report B: “We were planning to go on vacation next week, but the ventilation isn’t working properly. Perhaps it has something to do with the filters. Please change them at the next appointment.”
Both reports discuss the need for a replaced airfilter, but with different focuses and length.
The documents in my real-world problem are approximately 10-200 tokens long and cover a broad and deeply specialized technical domain with synonyms, abbreviations, etc. Unfortunately, the average document vector is not sufficient to recognize these nuances. I’ve been considering training a summary that extracts the core of the problem from each document and using that vector representation. But would like to know what the current sota approach for this problem is.
For reference i have about 250mb of unlabeld domain-specific text. I also startet generating a labeled trainingset of similar documents (only a few hundred examples) that i’ve used to finetune different pretrained models. So far the results are sobering