Sharing BERT formatted corpus

First of all, sorry if I am missing something because this is my first post on the forum.
I have thought that users / teams that train and share BERT like models could share the formatted corpus in order to other users to be able to use it in new models such as Electra easily saving time and computation (as the corpus format is the same).

Maybe HF could store and share formatted corpus or just the users/team can provide it.

The idea comes because I have seen several BERT like models in the hub for different languages and I would like to bring it to Electra as it needs less resources for pretraining and I (and maybe many users) can deal with it.

4 Likes

HuggingFace already provides ~100 NLP datasets in their nlp repository. I think these are mostly evaluation datasets. However, you can add your own dataset, too!

1 Like

Indeed, we provide the Toronto Book Corpus and Wikipedia (i.e. the datasets used to pretrain Bert) but we are happy to welcome more datasets. Currently we are in the process of adding a very large multilingual dataset based on CommonCrawl and called “OSCAR”. You can follow this process here: https://github.com/huggingface/nlp/pull/348

2 Likes

Great! It would be so cool to have several languages big corpus on the nlp lib

Hi @mrm8488.

Second your point, I am also thinking for storage space’s sake, we can host different preprocessing script. So when you say load dataset(Electra), it actually do load_dataset(wikipedia) and load_dataset(bookcorpus) and go through electra preprocessing script, and for sure, cache it.

Before that come into reality, You definitely would want to check my work that pretrain Electra from the scratch, and fine tune on GLUE ,with fastai and huggin face/nlp.
repository: Pretrain-MLM-and-finetune-on-GLUE-with-fastai
the latest post of a series: [HuggingFace/nlp] Create fastai Dataloaders, show batch, and create dataset for LM, MLM

I have successfully finetune using hf model and reproduce the Electra small++ result, I am pretraining model from the scratch. You can get updates from my twitter Richard Wang

6 Likes

Is the Toronto Book Corpus you provide the original one (used to pretrain BERT, XLNet, GPT, etc.) or is it recreated? I thought that it is not provided by the authors (any more) but it would be great if they are publicly available

Hi @anon7651424,
The source of bookcorpus on the hub is based on this.
As it is w/o document boundary and is not official, may be we should add sth. like “at your own risk” to the introduction string.

1 Like

Indeed we should add a disclaimer that no-one really knows what was in the TBC used for training Bert.
I’ll add one.

2 Likes