Pretrain RoBERTa-large from scratch in Swedish

Roberta Swedish

The project idea is identical to the one for Pretraining Roberta in Spanish but instead using the Swedish dataset.

The idea is to use the Swedish portion of mC4 (which roughly amounts for 100GB of uncompressed text) to pre-train a RoBERTa-large model, first on 256 sequence length and then on 512. It might be a good idea to start in 128, as suggested here .

2. Language

The model will be trained in Swedish.

3. Model

RoBERTa-large

4. Datasets

Swedish portion of mC4 of about 100gb of uncompressed data.

5. Training scripts

There are already Flax scripts to pre-train RoBERTa that we can easily use:

https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)

6. Challenges

Will the data be enough to be able to train a good model?

7. Desired project outcome

A Swedish monolingual well performing model on the usual benchmarks.

8. Reads

- https://arxiv.org/pdf/1907.11692.pdf
- https://arxiv.org/pdf/1911.00359.pdf
- https://www.aclweb.org/anthology/W11-2123.pdf
2 Likes

Just a suggestion: How about we pretrain a common Scandinavian model instead, as our languages are so similar? I’ve made a project suggestion here: Scandinavian RoBERTa.

2 Likes

Officially defining this one as well :slight_smile: