Hi @hemangr8, a very simple thing you can try is:
- Split the document into passages or sentences
- Embed each passage / sentence as a vector
- Take the average of the vectors to get a single vector representation of the document
- Compare documents using your favourite similarity metric (e.g. cosine etc)
Depending on the length of your documents, you could also try using the Longformer Encoder-Decoder which has a context size of 16K tokens: allenai/led-large-16384 · Hugging Face
If your document fit within the 16K limit you could embed them in one go. There’s some related ideas also in this thread: Summarization on long documents