The idea is to use the Spanish portion of mC4 (which roughly amounts for 1TB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here.
The model will be trained in Spanish (regardless of variety).
Spanish portion of mC4 of about 1TB of uncompressed data.
There are already Flax scripts to pre-train RoBERTa that we can easily use:
It is too much data. We need a way to reduce the amount of data to finish on time. Options:
- Random sampling.
- Perplexity sampling using percentiles and a Spanish language model. One option here is to use a 5-gram Kneser-Ney model as implemented in the KenLM library (Heafield, 2011) and released by Facebook.
A Spanish monolingual well performing model on the usual benchmarks.
- https://arxiv.org/pdf/1907.11692.pdf - https://arxiv.org/pdf/1911.00359.pdf - https://www.aclweb.org/anthology/W11-2123.pdf