The project idea is identical to the one for Pretraining Roberta in Spanish but instead using the Swedish dataset.
The idea is to use the Swedish portion of mC4 (which roughly amounts for 100GB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here .
The model will be trained in Swedish.
Swedish portion of mC4 of about 100gb of uncompressed data.
There are already Flax scripts to pre-train RoBERTa that we can easily use:
Will the data be enough to be able to train a good model?
A Swedish monolingual well performing model on the usual benchmarks.
- https://arxiv.org/pdf/1907.11692.pdf - https://arxiv.org/pdf/1911.00359.pdf - https://www.aclweb.org/anthology/W11-2123.pdf