Text format for language modeling

BramVanroy · October 2, 2021, 6:26pm

Hello

I cannot seem to find this information anywhere, but perhaps because the search terms are quite general.

I am wondering how the input data format has to look for language modeling. I am particularly interested in CLM but I’d also like to know for MLM.

My intuition tells me that one should:

split dataset into sentences
? do linguistic tokenization (split by whitespace) ?
insert a “begin of sentence” and “end of sentence” tokens
merge the sentences again
chunk the text up in blocks of max_seq_len length for the model (you could even do a sliding window of max_seq_len size)
so you just have a text file and every paragraph is one (potentially huge) line

This way, the model should be able to generalize better as it learns arbitrary start and ending positions for sentences. That being said, I do not know whether the position embeddings have a negative impact on this process as they are not “correct” any more. (For chunked sentences, the first word may not be the first word of the sentence.)

Looking at the LM examples, it seems that such steps are not taken automatically, except for chunking. So is my intuition wrong? If so, how exactly should a given text file look?

As a side question: it is not clear to me from the example run_clm script what exactly is meant with " We drop the small remainder". If a text file is given with potentially very long lines (one line per paragraph), does that mean that everything exceeding the block size in that line is discarded? If so, there must be a better way to organise your text file than I illustrated above.

EDIT: I found this useful documentation by HF concerning a strided sliding window, which is exactly what I intended above. It seems that this is not implemented in the examples, however. I wonder why.

Topic		Replies	Views
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12761	October 6, 2021
Exhaustive list of changes across all touchpoints in the tokenization pipeline of LM training 🤗Transformers	0	288	June 26, 2023
Documentation: Transformers Language Modeling Section Beginners	0	325	May 14, 2022
Best practice for MLM: full text or break into sentences? Beginners	0	489	November 18, 2021
Why split sequences into shorter chunks when pretraining llm Beginners	0	1030	August 16, 2023

Text format for language modeling

Related topics