Document Similarity of long documents e.g. legal contracts

lewtun · February 10, 2021, 2:09pm

Hi @hemangr8, a very simple thing you can try is:

Split the document into passages or sentences
Embed each passage / sentence as a vector
Take the average of the vectors to get a single vector representation of the document
Compare documents using your favourite similarity metric (e.g. cosine etc)

Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face

If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents

Topic		Replies	Views
How to find similarity in documents longer than input sequence length? Beginners	2	2064	August 25, 2022
What is best way to compute document similarity? Beginners	1	4107	June 21, 2022
Finetuning transformers for long document summarisation Beginners	0	341	October 25, 2022
Compare 2 long texts Beginners	0	1487	May 2, 2023
Summarization on long documents 🤗Transformers	63	58964	August 16, 2024

Document Similarity of long documents e.g. legal contracts

Related topics