Sharing BERT formatted corpus

mrm8488 · July 7, 2020, 10:45pm

First of all, sorry if I am missing something because this is my first post on the forum.
I have thought that users / teams that train and share BERT like models could share the formatted corpus in order to other users to be able to use it in new models such as Electra easily saving time and computation (as the corpus format is the same).

Maybe HF could store and share formatted corpus or just the users/team can provide it.

The idea comes because I have seen several BERT like models in the hub for different languages and I would like to bring it to Electra as it needs less resources for pretraining and I (and maybe many users) can deal with it.

BramVanroy · July 8, 2020, 7:18am

HuggingFace already provides ~100 NLP datasets in their nlp repository. I think these are mostly evaluation datasets. However, you can add your own dataset, too!

thomwolf · July 8, 2020, 7:22am

Indeed, we provide the Toronto Book Corpus and Wikipedia (i.e. the datasets used to pretrain Bert) but we are happy to welcome more datasets. Currently we are in the process of adding a very large multilingual dataset based on CommonCrawl and called “OSCAR”. You can follow this process here: https://github.com/huggingface/nlp/pull/348

mrm8488 · July 8, 2020, 11:18am

Great! It would be so cool to have several languages big corpus on the nlp lib

RichardWang · July 8, 2020, 12:41pm

Hi @mrm8488.

Second your point, I am also thinking for storage space’s sake, we can host different preprocessing script. So when you say load dataset(Electra), it actually do load_dataset(wikipedia) and load_dataset(bookcorpus) and go through electra preprocessing script, and for sure, cache it.

Before that come into reality, You definitely would want to check my work that pretrain Electra from the scratch, and fine tune on GLUE ,with fastai and huggin face/nlp.
repository: Pretrain-MLM-and-finetune-on-GLUE-with-fastai
the latest post of a series: [HuggingFace/nlp] Create fastai Dataloaders, show batch, and create dataset for LM, MLM

I have successfully finetune using hf model and reproduce the Electra small++ result, I am pretraining model from the scratch. You can get updates from my twitter Richard Wang

anon7651424 · September 14, 2020, 12:26am

Is the Toronto Book Corpus you provide the original one (used to pretrain BERT, XLNet, GPT, etc.) or is it recreated? I thought that it is not provided by the authors (any more) but it would be great if they are publicly available

RichardWang · September 14, 2020, 7:37am

Hi @anon7651424,
The source of bookcorpus on the hub is based on this.
As it is w/o document boundary and is not official, may be we should add sth. like “at your own risk” to the introduction string.

thomwolf · September 15, 2020, 6:27am

Indeed we should add a disclaimer that no-one really knows what was in the TBC used for training Bert.
I’ll add one.

Topic		Replies	Views
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4658	January 22, 2021
Further pre-train language model in transformers like BERT Models	3	1110	March 27, 2022
Pre-Train BERT (from scratch) Research	43	19007	June 27, 2022
PreTrain RoBERTa from scratch in Hindi Flax/JAX Projects	24	2043	December 10, 2021
pre-train_BERT for a specific corpus 🤗Transformers	0	72	May 2, 2024

Sharing BERT formatted corpus

Related topics