Question about Bloom pretrain

zeyuteng · April 21, 2023, 8:21am

Hi, a question about bloom pretrain

In pretrain phase, I prepared set of unlabeled text in a .txt file. Each line is a paper or a paragraph in a paper. Each line should be independnet. So, the next line or next text is not relevant to previous one. The run_clm.py script can read those text line by line and concatenate all texts from our dataset and generate blocks by user-defined block_size param or a default value(which is 1024).

I have a question about the concatenation. If each line (or text) in my .txt file describe different thing (means each text or paraphs are independent), then the concatenation will merge them all without an explicit ‘end of text/end of paper’ mark. How the Bloom model predicts next token based on previous context. How the model can predict the first token in the new paragraph by seeing previous context (the previous context describe different context).

I tried to make one block only contain one paragraph or text, but they do not have same length and get an error. If I use the concatenation mechanism, I feel like it is totally wrong. Can anyone help me to figure out these.

Topic		Replies	Views
Query about group_texts in run_mlm_no_trainer.py Beginners	0	647	April 12, 2022
Retain start and end of training samples for fine-tuning language modeling Beginners	1	457	March 22, 2022
Text format for language modeling 🤗Transformers	5	2322	October 10, 2021
BERT pre-training run_mlm_flax.py questions Beginners	0	254	November 3, 2021
Why split sequences into shorter chunks when pretraining llm Beginners	0	1015	August 16, 2023

Question about Bloom pretrain

Related topics