Question about maximum number of tokens


It is my understanding that all the pretrained models have a fixed maximum number of tokens (512 for bert-base-uncased). Suppose I have texts, that when tokenized exceed that number (like fictional text running through many paragraphs). I feel that there could be a better way than just using the first 512 tokens of the text. I could increase that limit, but my understanding is that for me to do that I have to train model from scratch and not be able to use the pretrained model. I would like to use the pretrained model.

In order to achieve this I have an idea and need some feedback on that:

  1. Split the text into a list of sentences using a sentence Sentence Boundary Disambiguation tool.
  2. Tokenize each sentence using the model’s corresponding tokenizer.
  3. Create our new text, by keep the first and last n sentences from the list and then taking a random subset of the rest of the sentences, such that all the tokens add up to 512.

This will not restrict the input to only the first 512 tokens and will include random sentences from the middle of the text. Any thoughts on this approach?

Sure, that is an option. You can also first run the text through a summarizer model and use the output as the input for your classifying model. There is no one “right” approach. You can try different things and see what works best for you.