Help on using OpenWebText dataset

Hello everyone,

I’m trying to pre-train GPT-2 from scratch, and decided to use OpenWebText dataset for the task!

However, according to the official site, the dataset should be pre-processed (such as filter non-english, remove duplication, etc).

So my question is whether the dataset at the hub is already pre-processed and I can use it as is, or I need to do that myself before using it for training?

In addition, if anyone can share an example of usage of the OpenWebText in a code or notebook I would appreciate it a lot!

Thanks in advance for any help on the topic :slight_smile:

Hi! Yes, this is the version of the dataset we host on the Hub. The only difference is that our script pulls the data from Zenodo, as Zenodo is more reliable than Google Drive.

Thanks for the quick response @mariosasko !
So just making sure I understand correct, the dataset on the hub is already filtered as specified here and I can use it as is, right?