Help understanding how to build a dataset for language as with the old TextDataset

nbroad · October 5, 2021, 9:02pm

hi @lhoestq , I know this is an old thread, but I have a follow-up question. If you tokenize and then group as suggested above, this will mean that some bos tokens and eos tokens will be in the middle of the input_ids sequence. For example, if the max length is 128 and you combine a 100 token sequence with the next 28 tokens of the following sequence, then element 100 (if 0 indexing) will be a bos token.

Is this problematic? Does MLM require a bos token at the beginning or eos at the end? Does it need any special tokens?

Topic		Replies	Views
Fine-tune transformers for language model Beginners	2	662	August 14, 2022
How can I use tokenized Dataset for Text Generation? Beginners	0	497	January 22, 2023
How did the dataset manages long sentences? 🤗Datasets	1	985	February 15, 2022
Preprocessing of dataset 🤗Tokenizers	0	172	April 10, 2024
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1075	August 19, 2021

Help understanding how to build a dataset for language as with the old TextDataset

Related topics