Document Similarity of long documents e.g. legal contracts

Is there any way of getting similarities between very long text documents. I know about the ways to get similarity between sentences using sentence transformers but is there a model that can give me a one shot output similar or not. Something like a siamese network that can tell if 2 random images are similar or not. I might be wrong about the analogy but it seems very similar.
If such models don’t exist then is there a method where I can make use of transformers to get similarities between long documents.

Hi @hemangr8, a very simple thing you can try is:

  1. Split the document into passages or sentences
  2. Embed each passage / sentence as a vector
  3. Take the average of the vectors to get a single vector representation of the document
  4. Compare documents using your favourite similarity metric (e.g. cosine etc)

Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face

If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents


Hi @hemangr8,

I am working on a similar problem so if you tried some of the suggested solutions, I am curious to know what was the best: Average over the vectors or using a longformer encoder-decoder?

I’m also working on a similar problem and would be interested in hearing your progress @hemangr8 and @maximilienroberti.

1 Like

Hey,I’m working on similar problem as well. Can you share your methodologies ? @jaxonkeeler