Best practice for MLM: full text or break into sentences?

jonathanalis · November 18, 2021, 6:35pm

Hello.
I am following this tutorial on fine-tuning a language model (both casual and masked): Google Colab

However, The way the data is presented to the model is bugging me.
Here, the text joined together in blocks of 128 tokens. The only restriction is that they are contiguous. But they can break sentences, surpass sentence breaks, document breaks, start in the middle of the sentences.

Is that really a good practice? Seems that it can lose the semantics, don’t?
BERT breaks the inputs into pairs of sentences with well-defined begin and ends, and also with separation tokens, so the model receives better-structured data. Why fine-tuning for a language model using new text would be different?
There is support for this in the literature?

Other related question: The example code break the blocks into 128 tokens. However, the maximum size for this model is 512. There is any advantage of breaking in this size? Intuitively, larger blocks contain more contiguous text (fewer sentence breaks). That isn’t good for the model?

Thank you in advance.

Topic		Replies	Views
Best solution for train tokenizer and MLM from scratch 🤗Tokenizers	0	729	December 6, 2021
Sentence splitting 🤗Tokenizers	7	31782	September 15, 2022
Chunks and batches in MLMs Beginners	1	1755	June 22, 2023
Text format for language modeling 🤗Transformers	5	2321	October 10, 2021
Questions about the connection between tokenizer and the model Beginners	0	308	September 19, 2023

Best practice for MLM: full text or break into sentences?

Related topics