Data-prep for new portuguese RoBERTa from scratch

I’m a NLP researcher from Brazil and our team is training RoBERTa base from scratch with around ~60GB Portuguese dataset. We plan on releasing it on HF model hub. Regarding data-prep, we have two options for documents longer than 512 tokens (maxlen):

  1. Truncate the data point and discard the rest

  2. Break long data points into smaller chunks of 512 tokens (generating new data points)

What are your opinions on these approaches?

My opinion is that you should break long data into smaller chunks.

One of the advantages of RoBERTa over BERT comes from the fact that it uses more data. if you throw away all but your first 512 tokens in each document, you will lose the “more data” advantage.

Liu et al created English RoBERTa using DOC-SENTENCES or FULL-SENTENCES regimes, either of which uses most of the words in each document, not just the first 512 tokens.

FULL-SENTENCES: each input is packed with full sentences sampled contiguously from one or more documents […] inputs may cross document boundaries.

DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross document boundaries.

I am not an expert, but I’m pretty sure that is correct.

The next two ideas are only speculation:

It might be even better if you could align the start of each chunk with the start of a sentence (but I don’t actually know whether that would make any difference).

Liu et al used 160GB of data. Since you have only 60GB, you might consider sampling your data several times with the splits in different positions. Maybe you could wrap each document into itself (ie once you reach the end of that document, if you haven’t reached a 512-token boundary, start again from the beginning of it.)

Sorry for taking so long to reply. Your answer was very helpful to me and to my team. Thank you very much! We have trained RoBERTa from scratch on a portuguese corpora and plan to release it for the public eventually.