Document Similarity of long documents e.g. legal contracts

hemangr8 · February 9, 2021, 7:26pm

Is there any way of getting similarities between very long text documents. I know about the ways to get similarity between sentences using sentence transformers but is there a model that can give me a one shot output similar or not. Something like a siamese network that can tell if 2 random images are similar or not. I might be wrong about the analogy but it seems very similar.
If such models don’t exist then is there a method where I can make use of transformers to get similarities between long documents.

lewtun · February 10, 2021, 2:09pm

Hi @hemangr8, a very simple thing you can try is:

Split the document into passages or sentences
Embed each passage / sentence as a vector
Take the average of the vectors to get a single vector representation of the document
Compare documents using your favourite similarity metric (e.g. cosine etc)

Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face

If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents

maximilienroberti · October 29, 2021, 3:43pm

Hi @hemangr8,

I am working on a similar problem so if you tried some of the suggested solutions, I am curious to know what was the best: Average over the vectors or using a longformer encoder-decoder?

jaxonkeeler · August 5, 2022, 9:24pm

I’m also working on a similar problem and would be interested in hearing your progress @hemangr8 and @maximilienroberti.

mlkorra · November 26, 2022, 5:05am

Hey,I’m working on similar problem as well. Can you share your methodologies ? @jaxonkeeler

yusuf802 · May 16, 2024, 6:08am

You can use FAISS, ChromaDB, Pinecone, Qdrant, Elastic Search, MongoDB Vector Search, etc.

These are few vector databases which can be used for text similarity

anton96vice · July 2, 2024, 10:11pm

My solution:

Turn the document to MarkDown with unstructured or by section tags
Split into sections and recursively if needed
Embed each section and compare vectors

If documents are very similar, you can use difflib or fuzzy matching like this


def fuzzy_diff(text1, text2, section_threshold=0.8, content_threshold=0.8):
    sections1 = split_into_sections(text1)
    sections2 = split_into_sections(text2)

    diff_result = []
    
    for section1 in sections1:
        best_match = max(sections2, key=lambda s: fuzz.ratio(section1, s))
        section_similarity = fuzz.ratio(section1, best_match) / 100.0
        
        if section_similarity < section_threshold:
            diff_result.append({"type": "removed", "content": section1})
        else:
            lines1 = section1.split('\n')
            lines2 = best_match.split('\n')
            
            differ = difflib.SequenceMatcher(None, lines1, lines2)
            for tag, i1, i2, j1, j2 in differ.get_opcodes():
                if tag == 'replace':
                    for line in lines1[i1:i2]:
                        best_line_match = max(lines2[j1:j2], key=lambda l: fuzz.ratio(line, l))
                        line_similarity = fuzz.ratio(line, best_line_match) / 100.0
                        if line_similarity < content_threshold:
                            diff_result.append({"type": "removed", "content": line})
                            diff_result.append({"type": "added", "content": best_line_match})
                elif tag == 'delete':
                    for line in lines1[i1:i2]:
                        diff_result.append({"type": "removed", "content": line})
                elif tag == 'insert':
                    for line in lines2[j1:j2]:
                        diff_result.append({"type": "added", "content": line})
    
    for section2 in sections2:
        if max(fuzz.ratio(section2, s) for s in sections1) / 100.0 < section_threshold:
            diff_result.append({"type": "added", "content": section2})
    return diff_result

Topic		Replies	Views
Compare 2 long texts Beginners	0	1496	May 2, 2023
What is best way to compute document similarity? Beginners	1	4114	June 21, 2022
How to find similarity in documents longer than input sequence length? Beginners	2	2067	August 25, 2022
Get sentence embedding vector using API? 🤗Transformers	0	335	September 10, 2021
Similarity Txt without GPUs alternatives Beginners	0	77	March 6, 2024

Document Similarity of long documents e.g. legal contracts

My solution:

Related topics