Text format for language modeling


I cannot seem to find this information anywhere, but perhaps because the search terms are quite general.

I am wondering how the input data format has to look for language modeling. I am particularly interested in CLM but I’d also like to know for MLM.

My intuition tells me that one should:

  • split dataset into sentences
  • ? do linguistic tokenization (split by whitespace) ?
  • insert a “begin of sentence” and “end of sentence” tokens
  • merge the sentences again
  • chunk the text up in blocks of max_seq_len length for the model (you could even do a sliding window of max_seq_len size)
  • so you just have a text file and every paragraph is one (potentially huge) line

This way, the model should be able to generalize better as it learns arbitrary start and ending positions for sentences. That being said, I do not know whether the position embeddings have a negative impact on this process as they are not “correct” any more. (For chunked sentences, the first word may not be the first word of the sentence.)

Looking at the LM examples, it seems that such steps are not taken automatically, except for chunking. So is my intuition wrong? If so, how exactly should a given text file look?

As a side question: it is not clear to me from the example run_clm script what exactly is meant with " We drop the small remainder". If a text file is given with potentially very long lines (one line per paragraph), does that mean that everything exceeding the block size in that line is discarded? If so, there must be a better way to organise your text file than I illustrated above.

EDIT: I found this useful documentation by HF concerning a strided sliding window, which is exactly what I intended above. It seems that this is not implemented in the examples, however. I wonder why.

1 Like

hey @BramVanroy you should check the new TF notebooks about that. i think they explain very clearly what are the steps involved. notebooks/language_modeling-tf.ipynb at new_tf_notebooks · huggingface/notebooks · GitHub

Thanks for the reply. I did find this, but it leaves a lot of questions unanswered:

  • Should the text be pretokenized? (from the example they showed it seems yes, but GPT-2 has been trained to treat spaces as tokens so this may not be ideal)
  • They do not use a sliding window
  • BOS/EOS are only inserted based on the original data. So sometimes EOS after a title sometimes after a paragraph. Wouldn’t it make more sense to have these sentence-segmented, i.e. EOS after every meaningful sentence

Hi @BramVanroy, I was posting a similar question in this thread here: Help understanding how to build a dataset for language as with the old TextDataset - #7 by nbroad

It seems like DataCollatorForLanguageModeling will prevent special tokens from being turned into [MASK] by using the special_tokens_mask from tokenization; however, the special tokens still get used as input. Depending on the sequence length, there could be BOS or EOS tokens in the middle of the sequence, as well as no BOS/EOS tokens at the beginning/end of the sequence. I don’t know if that is problematic.
transformers/data_collator.py at 57420b103e2a99aea0f5f80e98216029f7349af2 · huggingface/transformers · GitHub

[MASK] is not relevant for CausalLM so this is not very relevant for me.

While I may not have realized that you were talking about CLM, I think there is still relevant information here. The example scripts for both MLM and CLM show tokenizing and then grouping by max_seq_length, which means there will be sequences with BOS/EOS tokens in the middle and other sequences with no BOS/EOS tokens at the beginning/end of the sequence.

I’m wondering if this is problematic, and it seems like you are wondering about the impact of BOS/EOS tokens too