When embedding documents w/ more than 4096 tokens, does Longformer automatically creates chunks of 4096 tokens for a document (until all words are represented as tokens), where each chunk will be a tensor?
Either that or it just cuts off from the front/back/keeps-the-middle. Curious to know the answer too.
I have found that using sentence transformer, the model’s max length will the first # of tokens in your document. For example, if your model has 512 max tokens, a document with 1000 tokens will only be the first 512 tokens. Maybe Longformer does works the same way and takes only the first 4096 tokens of a document? I have not tested this myself but you can look at a long document and run Longformer on it, then check that text with a text file with that took the first 4096 tokens of that document. Edit the text file to be less than or larger than 4096 tokens to see if there is a difference in their cosine similarity score.