Data-prep for new portuguese RoBERTa from scratch

thiagolira · December 2, 2020, 1:49pm

I’m a NLP researcher from Brazil and our team is training RoBERTa base from scratch with around ~60GB Portuguese dataset. We plan on releasing it on HF model hub. Regarding data-prep, we have two options for documents longer than 512 tokens (maxlen):

Truncate the data point and discard the rest
Break long data points into smaller chunks of 512 tokens (generating new data points)

What are your opinions on these approaches?

rgwatwormhill · December 3, 2020, 1:44pm

My opinion is that you should break long data into smaller chunks.

One of the advantages of RoBERTa over BERT comes from the fact that it uses more data. if you throw away all but your first 512 tokens in each document, you will lose the “more data” advantage.

Liu et al created English RoBERTa using DOC-SENTENCES or FULL-SENTENCES regimes, either of which uses most of the words in each document, not just the first 512 tokens.

FULL-SENTENCES: each input is packed with full sentences sampled contiguously from one or more documents […] inputs may cross document boundaries.

DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross document boundaries.

I am not an expert, but I’m pretty sure that is correct.

The next two ideas are only speculation:

It might be even better if you could align the start of each chunk with the start of a sentence (but I don’t actually know whether that would make any difference).

Liu et al used 160GB of data. Since you have only 60GB, you might consider sampling your data several times with the splits in different positions. Maybe you could wrap each document into itself (ie once you reach the end of that document, if you haven’t reached a 512-token boundary, start again from the beginning of it.)

thiagolira · February 10, 2021, 11:52am

Sorry for taking so long to reply. Your answer was very helpful to me and to my team. Thank you very much! We have trained RoBERTa from scratch on a portuguese corpora and plan to release it for the public eventually.

vitali · May 20, 2021, 5:56pm

Are there any good examples of datasets for Roberta training available on the Hub?

vitali · May 20, 2021, 5:57pm

Could you share steps your have taken to create the dataset? I have the same task, need to prepare text data for training Roberta from scratch.

Topic	Replies	Views
BERT pre-training run_mlm_flax.py questions Beginners	254	November 3, 2021
Multiple sentences in RoBERTa training 🤗Datasets	573	August 10, 2021
Pre-Training From Scratch 🤗Transformers	1004	October 6, 2021
Fine tuning Sequence 🤗Transformers	209	August 27, 2021
RoBERT model for Sinhala Language Beginners	567	May 25, 2021

Data-prep for new portuguese RoBERTa from scratch

Related topics