In the Transformers doc about language modeling, Causal language modeling
we need to do data preprocess below:
- concatenate all the sequences
- split the concatenated sequences into shorter chunks defined by
block_size
, which should be both shorter than the maximum input length and short enough for your GPU RAM.
This kind of processing will generate many incomplete sentences. Additionally, I noticed that the text used in the examples is mostly documents, such as wikis, which might not have a significant impact. So, when the text is in a question-and-answer format, like a question followed by an answer, will this chunking operation still be necessary?