Help on using OpenWebText dataset

IdoAmit198 · October 18, 2022, 9:14am

Hello everyone,

I’m trying to pre-train GPT-2 from scratch, and decided to use OpenWebText dataset for the task!

However, according to the official site, the dataset should be pre-processed (such as filter non-english, remove duplication, etc).

So my question is whether the dataset at the hub is already pre-processed and I can use it as is, or I need to do that myself before using it for training?

In addition, if anyone can share an example of usage of the OpenWebText in a code or notebook I would appreciate it a lot!

Thanks in advance for any help on the topic

mariosasko · October 18, 2022, 1:45pm

Hi! Yes, this is the version of the dataset we host on the Hub. The only difference is that our script pulls the data from Zenodo, as Zenodo is more reliable than Google Drive.

IdoAmit198 · October 18, 2022, 1:56pm

Thanks for the quick response @mariosasko !
So just making sure I understand correct, the dataset on the hub is already filtered as specified here and I can use it as is, right?

Topic		Replies	Views
Continue pre-training GPT2 Research	1	593	March 10, 2025
Pretrain gpt2 example Beginners	0	305	June 11, 2021
Need help for splitting Openwebtext using load_dataset 🤗Datasets	1	588	February 2, 2023
PreTrain GPT2 from scratch in Persian Flax/JAX Projects	15	2101	July 7, 2021
GPT-2 fine-tuning Beginners	0	1609	June 12, 2023

Help on using OpenWebText dataset

Related topics