Longformer Token Length

puan1 · July 6, 2022, 4:16pm

When embedding documents w/ more than 4096 tokens, does Longformer automatically creates chunks of 4096 tokens for a document (until all words are represented as tokens), where each chunk will be a tensor?

nottakumasato · August 30, 2022, 7:46pm

Either that or it just cuts off from the front/back/keeps-the-middle. Curious to know the answer too.

puan1 · August 30, 2022, 10:08pm

I have found that using sentence transformer, the model’s max length will the first # of tokens in your document. For example, if your model has 512 max tokens, a document with 1000 tokens will only be the first 512 tokens. Maybe Longformer does works the same way and takes only the first 4096 tokens of a document? I have not tested this myself but you can look at a long document and run Longformer on it, then check that text with a text file with that took the first 4096 tokens of that document. Edit the text file to be less than or larger than 4096 tokens to see if there is a difference in their cosine similarity score.

Topic		Replies	Views
What happened when Longformer is trained on dataset longer than 4096? 🤗Transformers	0	290	June 29, 2021
Self-made Longformer doesn't take more than 512 token 🤗Transformers	0	459	January 5, 2022
Longformer for text summarization Beginners	10	5256	August 6, 2022
How to get Word Embeddings for Sentences/Documents using long-former model? Beginners	1	4143	October 8, 2022
LongFormer tokenizer has the same token_type_ids for sequence pairs 🤗Tokenizers	0	714	December 20, 2021

Longformer Token Length

Related topics