Is there any way of getting similarities between very long text documents. I know about the ways to get similarity between sentences using sentence transformers but is there a model that can give me a one shot output similar or not. Something like a siamese network that can tell if 2 random images are similar or not. I might be wrong about the analogy but it seems very similar.
If such models don’t exist then is there a method where I can make use of transformers to get similarities between long documents.
2 Likes
Hi @hemangr8, a very simple thing you can try is:
- Split the document into passages or sentences
- Embed each passage / sentence as a vector
- Take the average of the vectors to get a single vector representation of the document
- Compare documents using your favourite similarity metric (e.g. cosine etc)
Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face
If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents
3 Likes
Hi @hemangr8,
I am working on a similar problem so if you tried some of the suggested solutions, I am curious to know what was the best: Average over the vectors or using a longformer encoder-decoder?
I’m also working on a similar problem and would be interested in hearing your progress @hemangr8 and @maximilienroberti.
1 Like