What is the best format to create a dataset in?

dilolo · March 10, 2024, 6:17pm

I have a large corpus of articles from Wikipedia, news sites and books. I want to use them to create a language model from scratch using gpt2. How can I normalize them? Should I break each article and book into a sentence, or should I break them down into separate articles and book pages?

nielsr · March 10, 2024, 7:17pm

Hi,

Usually one just trains the model by shuffling documents and concatenating their text as explained in Training a causal language model from scratch - Hugging Face NLP Course.

However research has shown that LLMs improve if you train them on on a logical order of documents rather than randomly shuffling them: [2310.10638] In-Context Pretraining: Language Modeling Beyond Document Boundaries.

dilolo · March 10, 2024, 8:05pm

Thank you very much for response.

Topic		Replies	Views
Create samples for Causal Language Modelling Models	1	337	August 28, 2023
How to pretrain randomized language model with custom dataset Beginners	0	64	May 15, 2024
Format requirements of dataset when fine tuning another model 🤗Datasets	1	895	April 7, 2022
Text format for language modeling 🤗Transformers	5	2355	October 10, 2021
Can load_datasets load entire text files instead of splitting on new lines? Beginners	1	1735	February 14, 2022

What is the best format to create a dataset in?

Related topics