I have a large corpus of articles from Wikipedia, news sites and books. I want to use them to create a language model from scratch using gpt2. How can I normalize them? Should I break each article and book into a sentence, or should I break them down into separate articles and book pages?
Hi,
Usually one just trains the model by shuffling documents and concatenating their text as explained in Training a causal language model from scratch - Hugging Face NLP Course.
However research has shown that LLMs improve if you train them on on a logical order of documents rather than randomly shuffling them: [2310.10638] In-Context Pretraining: Language Modeling Beyond Document Boundaries.
2 Likes
Thank you very much for response.