How to find similarity in documents longer than input sequence length?

I am fairly new to ML/AI so I apologise before hand if I misunderstood things.

You cannot increase the length higher than what is maximally supported by the respective transformer model – Computing Sentence Embeddings — Sentence-Transformers documentation (sbert.net)

I am trying to find similarities between two documents provided by users, which don’t fit the sequence limit on most SBERT models of around 200-300 words. What should I do to find similarities between them? I couldn’t find any information on this, other than simply to truncate the input.

1 Like

I’m not familiar with sbert, but I will say that long-document (i.e., documents that encode to > 512 tokens) handling is a significant challenge and a major area of current research. You might want to check out long-document transformers like BigBirdPegasus or LongFormer.

1 Like

There is a pretty long thread on a similar topic (long text classification), on stackoverflow. Perhaps, you can see if you can adapt some of their solution to your problem.

For example, on the IMDB dataset, removing the middle part of the review works well, as most of the opinion appear at the beginning and the end of a review.