Hi @BramVanroy, I was posting a similar question in this thread here: Help understanding how to build a dataset for language as with the old TextDataset - #7 by nbroad
It seems like DataCollatorForLanguageModeling
will prevent special tokens from being turned into [MASK] by using the special_tokens_mask
from tokenization; however, the special tokens still get used as input. Depending on the sequence length, there could be BOS or EOS tokens in the middle of the sequence, as well as no BOS/EOS tokens at the beginning/end of the sequence. I don’t know if that is problematic.
transformers/data_collator.py at 57420b103e2a99aea0f5f80e98216029f7349af2 · huggingface/transformers · GitHub