Is there any way of getting similarities between very long text documents. I know about the ways to get similarity between sentences using sentence transformers but is there a model that can give me a one shot output similar or not. Something like a siamese network that can tell if 2 random images are similar or not. I might be wrong about the analogy but it seems very similar.
If such models don’t exist then is there a method where I can make use of transformers to get similarities between long documents.
Hi @hemangr8, a very simple thing you can try is:
- Split the document into passages or sentences
- Embed each passage / sentence as a vector
- Take the average of the vectors to get a single vector representation of the document
- Compare documents using your favourite similarity metric (e.g. cosine etc)
Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face
If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents
Hi @hemangr8,
I am working on a similar problem so if you tried some of the suggested solutions, I am curious to know what was the best: Average over the vectors or using a longformer encoder-decoder?
I’m also working on a similar problem and would be interested in hearing your progress @hemangr8 and @maximilienroberti.
Hey,I’m working on similar problem as well. Can you share your methodologies ? @jaxonkeeler
You can use FAISS, ChromaDB, Pinecone, Qdrant, Elastic Search, MongoDB Vector Search, etc.
These are few vector databases which can be used for text similarity
My solution:
- Turn the document to MarkDown with unstructured or by section tags
- Split into sections and recursively if needed
- Embed each section and compare vectors
If documents are very similar, you can use difflib or fuzzy matching like this
def fuzzy_diff(text1, text2, section_threshold=0.8, content_threshold=0.8):
sections1 = split_into_sections(text1)
sections2 = split_into_sections(text2)
diff_result = []
for section1 in sections1:
best_match = max(sections2, key=lambda s: fuzz.ratio(section1, s))
section_similarity = fuzz.ratio(section1, best_match) / 100.0
if section_similarity < section_threshold:
diff_result.append({"type": "removed", "content": section1})
else:
lines1 = section1.split('\n')
lines2 = best_match.split('\n')
differ = difflib.SequenceMatcher(None, lines1, lines2)
for tag, i1, i2, j1, j2 in differ.get_opcodes():
if tag == 'replace':
for line in lines1[i1:i2]:
best_line_match = max(lines2[j1:j2], key=lambda l: fuzz.ratio(line, l))
line_similarity = fuzz.ratio(line, best_line_match) / 100.0
if line_similarity < content_threshold:
diff_result.append({"type": "removed", "content": line})
diff_result.append({"type": "added", "content": best_line_match})
elif tag == 'delete':
for line in lines1[i1:i2]:
diff_result.append({"type": "removed", "content": line})
elif tag == 'insert':
for line in lines2[j1:j2]:
diff_result.append({"type": "added", "content": line})
for section2 in sections2:
if max(fuzz.ratio(section2, s) for s in sections1) / 100.0 < section_threshold:
diff_result.append({"type": "added", "content": section2})
return diff_result