Roberta Swedish
The project idea is identical to the one for Pretraining Roberta in Spanish but instead using the Swedish dataset.
The idea is to use the Swedish portion of mC4 (which roughly amounts for 100GB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here .
2. Language
The model will be trained in Swedish.
3. Model
RoBERTa-large
4. Datasets
Swedish portion of mC4 of about 100gb of uncompressed data.
5. Training scripts
There are already Flax scripts to pre-train RoBERTa that we can easily use:
https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
6. Challenges
Will the data be enough to be able to train a good model?
7. Desired project outcome
A Swedish monolingual well performing model on the usual benchmarks.
8. Reads
- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf