Question about maximum number of tokens

sudarshan85 · February 8, 2021, 11:47pm

Hi,

It is my understanding that all the pretrained models have a fixed maximum number of tokens (512 for bert-base-uncased). Suppose I have texts, that when tokenized exceed that number (like fictional text running through many paragraphs). I feel that there could be a better way than just using the first 512 tokens of the text. I could increase that limit, but my understanding is that for me to do that I have to train model from scratch and not be able to use the pretrained model. I would like to use the pretrained model.

In order to achieve this I have an idea and need some feedback on that:

Split the text into a list of sentences using a sentence Sentence Boundary Disambiguation tool.
Tokenize each sentence using the model’s corresponding tokenizer.
Create our new text, by keep the first and last n sentences from the list and then taking a random subset of the rest of the sentences, such that all the tokens add up to 512.

This will not restrict the input to only the first 512 tokens and will include random sentences from the middle of the text. Any thoughts on this approach?

BramVanroy · February 9, 2021, 9:01am

Sure, that is an option. You can also first run the text through a summarizer model and use the output as the input for your classifying model. There is no one “right” approach. You can try different things and see what works best for you.

Topic		Replies	Views
Maximum number of tokens using distilled-bert Beginners	0	291	August 17, 2023
Sentence splitting 🤗Tokenizers	7	31725	September 15, 2022
Limit max # of tokens for inference in pipeline? Beginners	0	1078	April 7, 2023
How to stop at 512 tokens when sending text to pipeline? 🤗Transformers	2	1422	February 7, 2024
Fine-tuning a masked language model Beginners	0	354	February 2, 2022

Question about maximum number of tokens

Related topics