How to find similarity in documents longer than input sequence length?

bxff · August 20, 2022, 10:03pm

I am fairly new to ML/AI so I apologise before hand if I misunderstood things.

You cannot increase the length higher than what is maximally supported by the respective transformer model – Computing Sentence Embeddings — Sentence-Transformers documentation (sbert.net)

I am trying to find similarities between two documents provided by users, which don’t fit the sequence limit on most SBERT models of around 200-300 words. What should I do to find similarities between them? I couldn’t find any information on this, other than simply to truncate the input.

mmalandro · August 21, 2022, 4:13pm

I’m not familiar with sbert, but I will say that long-document (i.e., documents that encode to > 512 tokens) handling is a significant challenge and a major area of current research. You might want to check out long-document transformers like BigBirdPegasus or LongFormer.

mvonwyl · August 25, 2022, 4:59pm

There is a pretty long thread on a similar topic (long text classification), on stackoverflow. Perhaps, you can see if you can adapt some of their solution to your problem.

For example, on the IMDB dataset, removing the middle part of the review works well, as most of the opinion appear at the beginning and the end of a review.

Topic		Replies	Views
Document Similarity of long documents e.g. legal contracts 🤗Transformers	6	8838	July 2, 2024
Is the way to input large size of text (over 512 words) exist? 🤗Transformers	0	936	October 27, 2021
Text input bigger than max tokens length for semantic search embeddings Beginners	1	1580	May 29, 2024
Sentiment analysis for long text - canonical solution Beginners	1	2448	April 22, 2023
Token Classification Models on (Very) Long Text Models	8	11156	March 9, 2023

How to find similarity in documents longer than input sequence length?

Related topics