It is my understanding that all the pretrained models have a fixed maximum number of tokens (512 for
bert-base-uncased). Suppose I have texts, that when tokenized exceed that number (like fictional text running through many paragraphs). I feel that there could be a better way than just using the first 512 tokens of the text. I could increase that limit, but my understanding is that for me to do that I have to train model from scratch and not be able to use the pretrained model. I would like to use the pretrained model.
In order to achieve this I have an idea and need some feedback on that:
- Split the text into a list of sentences using a sentence Sentence Boundary Disambiguation tool.
- Tokenize each sentence using the model’s corresponding tokenizer.
- Create our new text, by keep the first and last
nsentences from the list and then taking a random subset of the rest of the sentences, such that all the tokens add up to 512.
This will not restrict the input to only the first 512 tokens and will include random sentences from the middle of the text. Any thoughts on this approach?