Hi, a question about bloom pretrain
In pretrain phase, I prepared set of unlabeled text in a .txt file. Each line is a paper or a paragraph in a paper. Each line should be independnet. So, the next line or next text is not relevant to previous one. The run_clm.py script can read those text line by line and concatenate all texts from our dataset and generate blocks by user-defined block_size param or a default value(which is 1024).
I have a question about the concatenation. If each line (or text) in my .txt file describe different thing (means each text or paraphs are independent), then the concatenation will merge them all without an explicit ‘end of text/end of paper’ mark. How the Bloom model predicts next token based on previous context. How the model can predict the first token in the new paragraph by seeing previous context (the previous context describe different context).
I tried to make one block only contain one paragraph or text, but they do not have same length and get an error. If I use the concatenation mechanism, I feel like it is totally wrong. Can anyone help me to figure out these.