Hello everyone,
I’m trying to pre-train GPT-2 from scratch, and decided to use OpenWebText dataset for the task!
However, according to the official site, the dataset should be pre-processed (such as filter non-english, remove duplication, etc).
So my question is whether the dataset at the hub is already pre-processed and I can use it as is, or I need to do that myself before using it for training?
In addition, if anyone can share an example of usage of the OpenWebText in a code or notebook I would appreciate it a lot!
Thanks in advance for any help on the topic