Text format for language modeling

Hi @BramVanroy, I was posting a similar question in this thread here: Help understanding how to build a dataset for language as with the old TextDataset - #7 by nbroad

It seems like DataCollatorForLanguageModeling will prevent special tokens from being turned into [MASK] by using the special_tokens_mask from tokenization; however, the special tokens still get used as input. Depending on the sequence length, there could be BOS or EOS tokens in the middle of the sequence, as well as no BOS/EOS tokens at the beginning/end of the sequence. I don’t know if that is problematic.
transformers/data_collator.py at 57420b103e2a99aea0f5f80e98216029f7349af2 · huggingface/transformers · GitHub