Help understanding how to build a dataset for language as with the old TextDataset

hi @lhoestq , I know this is an old thread, but I have a follow-up question. If you tokenize and then group as suggested above, this will mean that some bos tokens and eos tokens will be in the middle of the input_ids sequence. For example, if the max length is 128 and you combine a 100 token sequence with the next 28 tokens of the following sequence, then element 100 (if 0 indexing) will be a bos token.

Is this problematic? Does MLM require a bos token at the beginning or eos at the end? Does it need any special tokens?