Text format for language modeling

nbroad · October 7, 2021, 3:16pm

Hi @BramVanroy, I was posting a similar question in this thread here: Help understanding how to build a dataset for language as with the old TextDataset - #7 by nbroad

It seems like DataCollatorForLanguageModeling will prevent special tokens from being turned into [MASK] by using the special_tokens_mask from tokenization; however, the special tokens still get used as input. Depending on the sequence length, there could be BOS or EOS tokens in the middle of the sequence, as well as no BOS/EOS tokens at the beginning/end of the sequence. I don’t know if that is problematic.
transformers/data_collator.py at 57420b103e2a99aea0f5f80e98216029f7349af2 · huggingface/transformers · GitHub

Topic		Replies	Views
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12794	October 6, 2021
Best practice for MLM: full text or break into sentences? Beginners	0	492	November 18, 2021
Documentation: Transformers Language Modeling Section Beginners	0	326	May 14, 2022
Exhaustive list of changes across all touchpoints in the tokenization pipeline of LM training 🤗Transformers	0	289	June 26, 2023
Using truncated fragments as input samples in training 🤗Tokenizers	3	693	July 1, 2021

Text format for language modeling

Related topics