Why split sequences into shorter chunks when pretraining llm

ChloeYang · August 16, 2023, 3:36am

In the Transformers doc about language modeling, Causal language modeling

we need to do data preprocess below:

concatenate all the sequences
split the concatenated sequences into shorter chunks defined by block_size, which should be both shorter than the maximum input length and short enough for your GPU RAM.

This kind of processing will generate many incomplete sentences. Additionally, I noticed that the text used in the examples is mostly documents, such as wikis, which might not have a significant impact. So, when the text is in a question-and-answer format, like a question followed by an answer, will this chunking operation still be necessary?

Topic		Replies	Views
Chunks and batches in MLMs Beginners	1	1755	June 22, 2023
Token Chunking in Causal/Masked Language Modeling Course	0	846	November 7, 2023
Best practice for MLM: full text or break into sentences? Beginners	0	486	November 18, 2021
Text format for language modeling 🤗Transformers	5	2321	October 10, 2021
Query about group_texts in run_mlm_no_trainer.py Beginners	0	647	April 12, 2022

Why split sequences into shorter chunks when pretraining llm

Related topics