Document Similarity of long documents e.g. legal contracts

Hi @hemangr8, a very simple thing you can try is:

  1. Split the document into passages or sentences
  2. Embed each passage / sentence as a vector
  3. Take the average of the vectors to get a single vector representation of the document
  4. Compare documents using your favourite similarity metric (e.g. cosine etc)

Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face

If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents