I am following this tutorial on fine-tuning a language model (both casual and masked): Google Colab
However, The way the data is presented to the model is bugging me.
Here, the text joined together in blocks of 128 tokens. The only restriction is that they are contiguous. But they can break sentences, surpass sentence breaks, document breaks, start in the middle of the sentences.
Is that really a good practice? Seems that it can lose the semantics, don’t?
BERT breaks the inputs into pairs of sentences with well-defined begin and ends, and also with separation tokens, so the model receives better-structured data. Why fine-tuning for a language model using new text would be different?
There is support for this in the literature?
Other related question: The example code break the blocks into 128 tokens. However, the maximum size for this model is 512. There is any advantage of breaking in this size? Intuitively, larger blocks contain more contiguous text (fewer sentence breaks). That isn’t good for the model?
Thank you in advance.