What is the best format to create a dataset in?

I have a large corpus of articles from Wikipedia, news sites and books. I want to use them to create a language model from scratch using gpt2. How can I normalize them? Should I break each article and book into a sentence, or should I break them down into separate articles and book pages?

Hi,

Usually one just trains the model by shuffling documents and concatenating their text as explained in Training a causal language model from scratch - Hugging Face NLP Course.

However research has shown that LLMs improve if you train them on on a logical order of documents rather than randomly shuffling them: [2310.10638] In-Context Pretraining: Language Modeling Beyond Document Boundaries.

2 Likes

Thank you very much for response.