I cannot seem to find this information anywhere, but perhaps because the search terms are quite general.
I am wondering how the input data format has to look for language modeling. I am particularly interested in CLM but I’d also like to know for MLM.
My intuition tells me that one should:
- split dataset into sentences
- ? do linguistic tokenization (split by whitespace) ?
- insert a “begin of sentence” and “end of sentence” tokens
- merge the sentences again
- chunk the text up in blocks of
max_seq_lenlength for the model (you could even do a sliding window of
- so you just have a text file and every paragraph is one (potentially huge) line
This way, the model should be able to generalize better as it learns arbitrary start and ending positions for sentences. That being said, I do not know whether the position embeddings have a negative impact on this process as they are not “correct” any more. (For chunked sentences, the first word may not be the first word of the sentence.)
Looking at the LM examples, it seems that such steps are not taken automatically, except for chunking. So is my intuition wrong? If so, how exactly should a given text file look?
As a side question: it is not clear to me from the example run_clm script what exactly is meant with " We drop the small remainder". If a text file is given with potentially very long lines (one line per paragraph), does that mean that everything exceeding the block size in that line is discarded? If so, there must be a better way to organise your text file than I illustrated above.
EDIT: I found this useful documentation by HF concerning a strided sliding window, which is exactly what I intended above. It seems that this is not implemented in the examples, however. I wonder why.