Document Similarity of long documents e.g. legal contracts

Is there any way of getting similarities between very long text documents. I know about the ways to get similarity between sentences using sentence transformers but is there a model that can give me a one shot output similar or not. Something like a siamese network that can tell if 2 random images are similar or not. I might be wrong about the analogy but it seems very similar.
If such models don’t exist then is there a method where I can make use of transformers to get similarities between long documents.

2 Likes

Hi @hemangr8, a very simple thing you can try is:

  1. Split the document into passages or sentences
  2. Embed each passage / sentence as a vector
  3. Take the average of the vectors to get a single vector representation of the document
  4. Compare documents using your favourite similarity metric (e.g. cosine etc)

Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face

If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents

3 Likes

Hi @hemangr8,

I am working on a similar problem so if you tried some of the suggested solutions, I am curious to know what was the best: Average over the vectors or using a longformer encoder-decoder?

I’m also working on a similar problem and would be interested in hearing your progress @hemangr8 and @maximilienroberti.

1 Like

Hey,I’m working on similar problem as well. Can you share your methodologies ? @jaxonkeeler

You can use FAISS, ChromaDB, Pinecone, Qdrant, Elastic Search, MongoDB Vector Search, etc.

These are few vector databases which can be used for text similarity

My solution:

  • Turn the document to MarkDown with unstructured or by section tags
  • Split into sections and recursively if needed
  • Embed each section and compare vectors

If documents are very similar, you can use difflib or fuzzy matching like this


def fuzzy_diff(text1, text2, section_threshold=0.8, content_threshold=0.8):
    sections1 = split_into_sections(text1)
    sections2 = split_into_sections(text2)

    diff_result = []
    
    for section1 in sections1:
        best_match = max(sections2, key=lambda s: fuzz.ratio(section1, s))
        section_similarity = fuzz.ratio(section1, best_match) / 100.0
        
        if section_similarity < section_threshold:
            diff_result.append({"type": "removed", "content": section1})
        else:
            lines1 = section1.split('\n')
            lines2 = best_match.split('\n')
            
            differ = difflib.SequenceMatcher(None, lines1, lines2)
            for tag, i1, i2, j1, j2 in differ.get_opcodes():
                if tag == 'replace':
                    for line in lines1[i1:i2]:
                        best_line_match = max(lines2[j1:j2], key=lambda l: fuzz.ratio(line, l))
                        line_similarity = fuzz.ratio(line, best_line_match) / 100.0
                        if line_similarity < content_threshold:
                            diff_result.append({"type": "removed", "content": line})
                            diff_result.append({"type": "added", "content": best_line_match})
                elif tag == 'delete':
                    for line in lines1[i1:i2]:
                        diff_result.append({"type": "removed", "content": line})
                elif tag == 'insert':
                    for line in lines2[j1:j2]:
                        diff_result.append({"type": "added", "content": line})
    
    for section2 in sections2:
        if max(fuzz.ratio(section2, s) for s in sections1) / 100.0 < section_threshold:
            diff_result.append({"type": "added", "content": section2})
    return diff_result